LATEST UPDATES

Income Analysis with Python: Pandas, Matplotlib & Seaborn Guide

Why Income Analysis Matters in 2024

Understanding how income is distributed across regions, ages, and industries is essential for businesses, policymakers, and data enthusiasts. With the rise of remote work, inflation, and gig‑economy jobs, traditional salary benchmarks are evolving fast. Analyzing income data helps you spot trends, identify gaps, and make data‑driven decisions that boost profitability and social impact.

Preparing Your Environment: Install Pandas, Matplotlib & Seaborn

Before diving into the data, set up a clean Python environment. The three libraries we’ll use are:

  • Pandas – data manipulation and cleaning.
  • Matplotlib – low‑level plotting engine.
  • Seaborn – high‑level statistical visualizations.

Run the following command in your terminal:

pip install pandas matplotlib seaborn

Once installed, import them at the top of your script:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

Setting the Seaborn style ensures all charts share a clean, modern look.

Loading and Cleaning Income Data

For this tutorial we’ll use a public U.S. Census income dataset. It contains columns such as age, education, occupation, hours_per_week, and salary (binary: >50K or <=50K).

# Load CSV
url = "https://example.com/census_income.csv"
df = pd.read_csv(url)

# Quick glance
print(df.head())

# Handle missing values
df = df.dropna()

# Convert categorical columns to "category" dtype for better memory usage
categorical_cols = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
for col in categorical_cols:
    df[col] = df[col].astype('category')

Cleaning steps like dropna() and dtype conversion are crucial: they prevent hidden bugs later and speed up aggregation.

Exploratory Data Analysis (EDA) with Pandas

Start with descriptive statistics to understand the overall income landscape.

# Summary of numeric columns
print(df.describe())

# Income distribution
income_counts = df['salary'].value_counts()
print(income_counts)

From the output you’ll see the proportion of high‑earners versus low‑earners. Next, explore how age influences salary.

age_income = df.groupby('age')['salary'].mean()
print(age_income.head())

This groupby operation returns the average probability of earning >50K for each age.

Visualizing Income Patterns with Matplotlib & Seaborn

Charts turn numbers into stories. Below are three essential visualizations.

1. Age vs. Income Probability

plt.figure(figsize=(10,6))
age_income.plot(kind='line', color='steelblue')
plt.title('Probability of Earning >50K by Age')
plt.xlabel('Age')
plt.ylabel('Probability')
plt.tight_layout()
plt.show()

This line chart quickly reveals that income probability rises sharply after age 30, peaks around 45, then slowly declines.

2. Income Distribution by Education Level

plt.figure(figsize=(12,7))
sns.boxplot(x='education', y='hours_per_week', hue='salary', data=df, palette='Set2')
plt.title('Hours Worked vs. Education Level, Colored by Income')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

The box‑plot highlights that people with a Doctorate tend to work fewer hours yet still belong to the >50K group, emphasizing the value of advanced education.

3. Heatmap of Correlation Matrix

# Encode categorical variables for correlation
encoded = pd.get_dummies(df[['age','education_num','hours_per_week','salary']])
corr = encoded.corr()
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', cbar=False)
plt.title('Correlation Heatmap of Key Income Features')
plt.show()

The heatmap makes it obvious that education_num and hours_per_week have a moderate positive correlation with high income, while age shows a weaker link.

Actionable Insights from the Visual Analysis

  • Targeted Upskilling: Employees with a high‑school diploma but many work hours still fall below the 50K threshold. Offering certification programs can push them into higher‑pay brackets.
  • Age‑Based Compensation Review: The dip after age 50 suggests a potential retirement‑prep window. Companies can design retention bonuses for seasoned staff.
  • Flexible Hours for Highly Educated Talent: The Doctorate box‑plot shows fewer hours are needed for high earnings, indicating that flexible‑work policies could attract top scholars.

These insights are not just academic—they can shape hiring, training, and compensation strategies.

Putting It All Together: A Reusable Python Script

Below is a compact script that loads data, cleans it, performs EDA, and saves the three charts as PNG files for reporting.

def main(csv_path):
    df = pd.read_csv(csv_path).dropna()
    # Encode categories
    for col in categorical_cols:
        df[col] = df[col].astype('category')
    
    # ---- EDA ----
    age_income = df.groupby('age')['salary'].mean()
    
    # ---- Plot 1 ----
    plt.figure(figsize=(10,6))
    age_income.plot(color='steelblue')
    plt.title('Probability of Earning >50K by Age')
    plt.xlabel('Age')
    plt.ylabel('Probability')
    plt.savefig('age_income.png')
    
    # ---- Plot 2 ----
    plt.figure(figsize=(12,7))
    sns.boxplot(x='education', y='hours_per_week', hue='salary', data=df, palette='Set2')
    plt.title('Hours Worked vs. Education Level')
    plt.xticks(rotation=45)
    plt.savefig('education_hours.png')
    
    # ---- Plot 3 ----
    encoded = pd.get_dummies(df[['age','education_num','hours_per_week','salary']])
    corr = encoded.corr()
    plt.figure(figsize=(8,6))
    sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', cbar=False)
    plt.title('Correlation Heatmap')
    plt.savefig('corr_heatmap.png')
    
    print('All charts saved!')

if __name__ == "__main__":
    main('census_income.csv')

Running this script produces ready‑to‑embed visuals for presentations, blogs, or internal dashboards.

Conclusion & Next Steps

By combining Pandas for data wrangling, Matplotlib for custom plots, and Seaborn for statistical visualizations, you can uncover hidden income patterns quickly and accurately. Start with the script above, adapt it to your own datasets, and experiment with additional features like geographic mapping or time‑series salary trends.

Ready to turn raw salary tables into actionable strategy? Contact us today for a personalized data‑science consulting session and accelerate your decision‑making.

Leave a Reply

Your email address will not be published. Required fields are marked *