Mastering Pandas: Data Manipulation and Analysis with Python's Pandas Library
Pandas is a popular Python library used for data manipulation and analysis. It provides various operations for working with tabular data structures called DataFrames. Here are some commonly used operations in pandas:
- Importing Pandas:
import pandas as pd
- Creating a DataFrame:
# From a dictionary
data = {'Name': ['John', 'Emma', 'Peter'],
'Age': [25, 30, 28],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
# From a CSV file
df = pd.read_csv('data.csv')
# From a database query
import sqlite3
conn = sqlite3.connect('database.db')
query = "SELECT * FROM table_name"
df = pd.read_sql_query(query, conn)
- Viewing Data:
# Display the first few rows
df.head()
# Display the last few rows
df.tail()
# Display summary statistics
df.describe()
# Display column names
df.columns
# Display DataFrame shape
df.shape
- Data Selection and Filtering:
# Select a single column
df['Column_Name']
# Select multiple columns
df[['Column1', 'Column2']]
# Select rows based on a condition
df[df['Age'] > 25]
# Combine multiple conditions
df[(df['Age'] > 25) & (df['City'] == 'London')]
# Select rows based on a specific value in a column
df.loc[df['City'] == 'Paris']
# Select rows by index
df.loc[2:4] # Select rows 2, 3, and 4
# Select rows by index and specific columns
df.loc[2:4, ['Column1', 'Column2']]
- Data Manipulation:
# Add a new column
df['New_Column'] = values
# Remove a column
df.drop('Column_Name', axis=1, inplace=True)
# Sort DataFrame by column(s)
df.sort_values('Column_Name', ascending=False)
# Group by a column and perform aggregation
df.groupby('Column_Name').agg({'Column1': 'sum', 'Column2': 'mean'})
# Apply a function to a column or DataFrame
df['Column_Name'].apply(function)
# Rename columns
df.rename(columns={'Old_Name': 'New_Name'}, inplace=True)
# Replace values in a column
df['Column_Name'].replace('Old_Value', 'New_Value', inplace=True)
- Missing Data Handling:
# Check for missing values
df.isnull()
# Drop rows with missing values
df.dropna()
# Fill missing values with a specific value
df.fillna(value)
# Interpolate missing values
df.interpolate()
- Data Aggregation:
# Calculate the sum of a column
df['Column_Name'].sum()
# Calculate the mean of a column
df['Column_Name'].mean()
# Calculate the maximum value of a column
df['Column_Name'].max()
# Calculate the minimum value of a column
df['Column_Name'].min()
# Calculate the count of non-null values in a column
df['Column_Name'].count()
# Calculate the unique values in a column
df['Column_Name'].unique()
# Calculate the frequency of each unique value in a column
df['Column_Name'].value_counts()
- Data Cleaning and Preprocessing:
# Remove duplicate rows
df.drop_duplicates()
# Replace missing values with the mean of the column
df['Column_Name'].fillna(df['Column_Name'].mean(), inplace=True)
# Remove leading/trailing whitespaces from strings
df['Column_Name'] = df['Column_Name'].str.strip()
# Convert a column to lowercase
df['Column_Name'] = df['Column_Name'].str.lower()
# Convert a column to datetime format
df['Column_Name'] = pd.to_datetime(df['Column_Name'])
# Convert categorical variables to numerical using one-hot encoding
df_encoded = pd.get_dummies(df['Column_Name'])
- Data Aggregation and Grouping:
# Group by multiple columns and calculate the sum
df.groupby(['Column1', 'Column2']).sum()
# Pivot table - reorganize data based on column values
df.pivot_table(index='Column1', columns='Column2', values='Column3', aggfunc='mean')
# Resample time series data
df.resample('D').mean() # Resample to daily frequency and calculate mean
# Apply a custom aggregation function
def custom_agg(x):
return x.max() - x.min()
df.groupby('Column1')['Column2'].agg(custom_agg)
- Data Transformation:
# Apply a function element-wise to a column
df['Column_Name'].apply(lambda x: function(x))
# Apply a function row-wise to a DataFrame
df.apply(function, axis=1)
# Apply a function to each group in a grouped DataFrame
df.groupby('Column1').apply(function)
# Merge two DataFrames based on a common column
merged_df = pd.merge(df1, df2, on='Column_Name', how='inner')
# Concatenate DataFrames vertically
concatenated_df = pd.concat([df1, df2])
# Reshape DataFrame using melt function
melted_df = pd.melt(df, id_vars=['Column1', 'Column2'], value_vars=['Column3', 'Column4'])
- Data Visualization:
# Plot a line chart
df.plot(x='Column1', y='Column2', kind='line')
# Plot a bar chart
df.plot(x='Column1', y='Column2', kind='bar')
# Plot a scatter plot
df.plot(x='Column1', y='Column2', kind='scatter')
# Plot a histogram
df['Column_Name'].plot(kind='hist')
# Plot a box plot
df.boxplot(column='Column_Name')
- Data Input and Output:
# Write DataFrame to a CSV file
df.to_csv('data.csv', index=False)
# Write DataFrame to an Excel file
df.to_excel('data.xlsx', index=False)
# Read data from an Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Read data from a JSON file
df = pd.read_json('data.json')
- Handling Time Series Data:
# Set a column as the index (datetime column)
df.set_index('Datetime_Column', inplace=True)
# Resample time series data to a lower frequency
df.resample('W').sum() # Resample to weekly frequency and calculate the sum
# Shift values in a column by a specified number of periods
df['Shifted_Column'] = df['Column_Name'].shift(1)
# Rolling window calculations (e.g., rolling mean, rolling sum)
df['Rolling_Mean'] = df['Column_Name'].rolling(window=3).mean()
- Merging and Joining DataFrames:
# Concatenate DataFrames horizontally
concatenated_df = pd.concat([df1, df2], axis=1)
# Merge two DataFrames on multiple columns
merged_df = pd.merge(df1, df2, on=['Column1', 'Column2'], how='inner')
# Perform a database-style join on multiple columns
joined_df = df1.join(df2.set_index(['Column1', 'Column2']), on=['Column1', 'Column2'])
- Handling Categorical Data:
# Convert a column to categorical data type
df['Column_Name'] = df['Column_Name'].astype('category')
# Encode categorical variables using label encoding
df['Encoded_Column'] = df['Column_Name'].cat.codes
# Perform one-hot encoding on multiple categorical columns
encoded_df = pd.get_dummies(df, columns=['Column1', 'Column2'])
- Handling Text Data:
# Convert a column to string data type
df['Column_Name'] = df['Column_Name'].astype(str)
# Perform string matching and extraction
df['Extracted_Text'] = df['Column_Name'].str.extract(r'(\d+)')
# Perform string replacement
df['Column_Name'] = df['Column_Name'].str.replace('old_text', 'new_text')
# Perform case-insensitive string matching
df[df['Column_Name'].str.contains('pattern', case=False)]
- Handling Missing Data:
# Check for missing values
df.isnull()
# Drop rows or columns with missing values
df.dropna() # Drops rows with any missing values
df.dropna(axis=1) # Drops columns with any missing values
# Fill missing values with forward or backward filling
df.fillna(method='ffill') # Forward fill
df.fillna(method='bfill') # Backward fill
- Working with MultiIndex DataFrames:
# Create a DataFrame with a MultiIndex
df = pd.DataFrame(data, index=[['A', 'A', 'B', 'B'], [1, 2, 1, 2]], columns=['Column1', 'Column2'])
# Select data from a MultiIndex DataFrame
df.loc['A'] # Select data from the first level index 'A'
df.loc['A', 1] # Select data from the first level index 'A' and the second level index 1
# Perform operations on specific levels of the index
df.sum(level=0) # Sum values at the first level index
df.mean(level=1) # Calculate the mean at the second level index