When I started learning Data Analysis a few years ago, the first thing I learned was SQL and Pandas. As a data analyst, it is crucial to have a strong foundation in working with SQL and Pandas. Both are powerful tools that help data analysts efficiently analyze and manipulate stored data in databases.
Overview of SQL and Pandas
SQL (Structured Query Language) is a programming language used to manage and manipulate relational databases. On the other hand, Pandas is a Python library used for data manipulation and analysis. Data analysis involves working with large amounts of data, and databases are often used to store this data. SQL and Pandas provide powerful tools for working with databases, allowing data analysts to efficiently extract, manipulate, and analyze data. By leveraging these tools, data analysts can gain valuable insights from data that would otherwise be difficult to obtain.
In this article, we will explore how to use SQL and Pandas to read and write to a database.
Connecting to the DB
Installing the Libraries
We must first install the necessary libraries before we can connect to the SQL database with Pandas. The two main libraries required are Pandas and SQLAlchemy. Pandas is a popular data manipulation library that allows for the storage of large data structures, as mentioned in the introduction. In contrast, SQLAlchemy provides an API for connecting to and interacting with the SQL database. We can install both libraries using the Python package manager, pip, by running the following commands at the command prompt.
$ pip install pandas $ pip install sqlalchemy
Making the Connection
With the libraries installed, we can now use Pandas to connect to the SQL database. To begin, we will create a SQLAlchemy engine object with create_engine(). The create_engine() function connects the Python code to the database. It takes as an argument a connection string that specifies the database type and connection details. In this example, we’ll use the SQLite database type and the database file’s path. Create an engine object for a SQLite database using the example below:
import pandas as pd from sqlalchemy import create_engine engine = create_engine('sqlite:///C/SQLite/student.db')
If the SQLite database file, student.db in our case, is in the same directory as the Python script, we can use the file name directly, as shown below.
engine = create_engine('sqlite:///student.db')
Reading SQL Files with Pandas
Let’s read data now that we’ve established a connection. In this section, we will look at the read_sql, read_sql_table, and read_sql_query functions and how to use them to work with a database.
Executing SQL Queries using Panda’s read_sql() Function
The read_sql() is a Pandas library function that allows us to execute an SQL query and retrieve the results into a Pandas dataframe. The read_sql() function connects SQL and Python, allowing us to take advantage of the power of both languages. The function wraps read_sql_table() and read_sql_query(). The read_sql() function is internally routed based on the input provided, which means that if the input is to execute an SQL query, it will be routed to read_sql_query(), and if it is a database table, it will be routed to read_sql_table(). The read_sql() syntax is as follows:
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
SQL and con parameters are required; the rest are optional. However, we can manipulate the result using these optional parameters. Let’s take a closer look at each parameter.
– sql: SQL query or database table name
– con: Connection object or connection URL
– index_col: This parameter allows us to use one or more columns from the SQL query result as a data frame index. It can take either a single column or a list of columns.
– coerce_float: This parameter specifies whether non-numerical values should be converted to floating numbers or left as strings. It is set to true by default. If possible, it converts non-numeric values to float types.
– params: The params provide a secure method for passing dynamic values to the SQL query. We can use the params parameter to pass a dictionary, tuple, or list. Depending on the database, the syntax of params varies.
– parse_dates: This allows us to specify which column in the resulting dataframe will be interpreted as a date. It accepts a single column, a list of columns, or a dictionary with the key as the column name and the value as the column format.
– columns: This allows us to fetch only selected columns from the list.
– chunksize: When working with a large data set, chunksize is important. It retrieves the query result in smaller chunks, enhancing performance.
Here’s an example of how to use read_sql():
import pandas as pd from sqlalchemy import create_engine engine = create_engine('sqlite:///C/SQLite/student.db') df = pd.read_sql("SELECT * FROM Student", engine, index_col='Roll Number', parse_dates='dateOfBirth') print(df) print("The Data type of dateOfBirth: ", df.dateOfBirth.dtype) engine.dispose()
Output:
firstName lastName email dateOfBirth rollNumber
1 Mark Simson [email protected] 2000-02-23 1
2 Peter Griffen [email protected] 2001-04-15 2
3 Meg Aniston [email protected] 2001-09-20
Date type of dateOfBirth: datetime64[ns]
After connecting to the database, we execute a query that returns all records from the Student table and stores them in the DataFrame df. The “Roll Number” column is converted into an index using the index_col parameter, and the “dateOfBirth” datatype is “datetime64[ns]” due to parse_dates. We can use read_sql() not only to retrieve data but also to perform other operations such as insert, delete, and update. read_sql() is a generic function.
Loading Specific Tables or Views from the DB
Loading a specific table or view with Pandas read_sql_table() is another technique to read data from the database into a Pandas dataframe.
What is read_sql_table?
The Pandas library provides the read_sql_table function, which is specifically designed to read an entire SQL table without executing any queries and return the result as a Pandas dataframe.
The syntax of read_sql_table() is as below:
pandas.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None)
Except for table_name and schema, the parameters are explained in the same way as read_sql().
– table_name: The parameter table_name is the name of the SQL table in the database.
– schema: This optional parameter is the name of the schema containing the table name.
After creating a connection to the database, we will use the read_sql_table function to load the Student table into a Pandas DataFrame.
import pandas as pd from sqlalchemy import create_engine engine = create_engine('sqlite:///C/SQLite/student.db') df = pd.read_sql_table('Student', engine) print(df.head()) engine.dispose()
Output:
rollNumber firstName lastName email dateOfBirth
0 1 Mark Simson [email protected] 2000-02-23
1 2 Peter Griffen [email protected] 2001-04-15
2 3 Meg Aniston [email protected] 2001-09-20
We’ll assume it is a large table that can be memory-intensive. Let’s explore how we can use the chunksize parameter to address this issue.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
import pandas as pd from sqlalchemy import create_engine engine = create_engine('sqlite:///C/SQLite/student.db') df_iterator = pd.read_sql_table('Student', engine, chunksize = 1) for df in df_iterator: print(df.head()) engine.dispose()
Output:
rollNumber firstName lastName email dateOfBirth
0 1 Mark Simson [email protected] 2000-02-23
0 2 Peter Griffen [email protected] 2001-04-15
0 3 Meg Aniston [email protected] 2001-09-20
Please keep in mind that the chunksize I’m using here is 1 because I only have 3 records in my table.
Querying the DB Directly with Pandas’ SQL Syntax
Extracting insights from the database is an important part for data analysts and scientists. To do so, we will leverage the read_sql_query() function.
What is read_sql_query()?
Using Pandas’ read_sql_query() function, we can run SQL queries and get the results directly into a DataFrame. The read_sql_query() function is created specifically for SELECT statements. It cannot be used for any other operations, such as DELETE or UPDATE.
Syntax:
pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None, dtype_backend=_NoDefault.no_default)
All parameter descriptions are the same as the read_sql() function.
Here’s an example of read_sql_query():
import pandas as pd from sqlalchemy import create_engine engine = create_engine('sqlite:///C/SQLite/student.db') df = pd.read_sql_query('Select firstName, lastName From Student Where rollNumber = 1', engine) print(df) engine.dispose()
Output:
firstName lastName
0 Mark Simson
Writing SQL Files with Pandas
While analyzing data, suppose we discovered that a few entries need to be modified or that a new table or view with the data is required. To update or insert a new record, one method is to use read_sql() and write a query. However, that method
Source link