Introduction to Pandas
Pandas is one of the most widely used Python libraries for data analysis and manipulation. Its name is derived from the term “Panel Data”, which refers to multidimensional structured datasets. It provides powerful and flexible tools to efficiently manipulate, analyze, and prepare data for computations, visualizations, or machine learning tasks.
Key features of Pandas include:
- Tabular data structures called
DataFrame
analogous to Excel or SQL tables. - Flexible
Series
for one-dimensional data. - Powerful built-in operations for filtering, grouping, transforming, and merging datasets.
- Extensive support for handling missing data.
- Compatibility with a variety of data formats, including CSV, Excel, JSON, SQL, and more.
Pandas is open-source and is built on top of NumPy, making it fast and memory-efficient. It has become the go-to library for data science, machine learning, and general-purpose data processing tasks.
Pandas API Explanation with Code Snippets
Below is an extensive list of commonly used Pandas API functions with examples. This aims to introduce you to over 50 commands that you can use to manipulate and analyze your data.
1. Loading Data
pd.read_csv()
Loads data from a CSV file into a DataFrame.
import pandas as pd df = pd.read_csv("data.csv") print(df.head())
pd.read_excel()
Loads data from an Excel file.
df = pd.read_excel("data.xlsx", sheet_name="Sheet1") print(df.head())
pd.read_json()
Loads data from a JSON file.
df = pd.read_json("data.json") print(df.head())
pd.read_sql()
Loads data from a SQL database query.
import sqlite3 connection = sqlite3.connect("database.db") df = pd.read_sql("SELECT * FROM table_name;", connection) print(df.head())
2. Inspecting Data
df.head()
Displays the first 5 rows of the dataset (can use df.head(n)
for the first n
rows).
print(df.head())
df.tail()
Displays the last 5 rows of the dataset.
print(df.tail())
df.info()
Gives an overview of the DataFrame, including datatypes, column names, and non-null counts.
print(df.info())
df.describe()
Provides summary statistics for numeric columns.
print(df.describe())
df.shape
Returns the dimensions of the DataFrame as a tuple (rows, columns).
print(df.shape)
df.columns
Returns a list of column names in the DataFrame.
print(df.columns)
df.index
Returns the index (row labels) of the DataFrame.
print(df.index)
3. Selection and Indexing
df["column_name"]
Selects a single column.
print(df["column_name"])
df[["col1", "col2"]]
Selects multiple columns.
print(df[["col1", "col2"]])
df.iloc[]
Selects rows and columns using integer-based indexing.
# Select the first row print(df.iloc[0]) # Select rows 0 to 2 and columns 1 to 3 print(df.iloc[0:3, 1:4])
df.loc[]
Selects rows and columns using label-based indexing.
# Select rows where column 'A' equals 10 print(df.loc[df['A'] == 10]) # Select specific rows and columns by labels print(df.loc[0:5, ["column1", "column2"]])
df.at[]
/ df.iat[]
Access a single value. df.at
uses labels, while df.iat
uses integer positions.
# Access using labels print(df.at[0, 'column_name']) # Access using index positions print(df.iat[0, 1])