Python Modules and the Data Science Stack

Jeremy R. Manning

PSYC 81.09: Storytelling with Data

How we approach tools in this course

In this course, we focus on understanding WHAT these tools do and WHEN to use them -- not memorizing syntax. AI handles the syntax; you handle the thinking.

Your job is to:

  1. Know which tool solves which problem
  2. Describe what you want in plain language
  3. Verify the generated code does what you expect

What is a Python module?

  • A Python module is a package that provides access to functions, variables, and data within your workspace.
  • Modules extend the Python standard library -- they are Python's "apps."
  • AKA: library, package, toolbox, toolkit

Install with pip install <module_name> in Terminal, or !pip install <module_name> in Colab.

The Python data science stack

Key libraries you will encounter

Library What it does When to reach for it
NumPy Fast numerical arrays and math Crunching numbers, linear algebra
Pandas Tabular data (spreadsheets) Loading CSVs, filtering rows, grouping
Matplotlib / Seaborn Plotting and visualization Any time you make a figure
Scikit-learn Machine learning Classification, clustering, regression
HyperTools High-dimensional data visualization Exploring complex datasets

You do not need to memorize their APIs. You need to know which one to ask for.

What is NumPy?

  • The foundation of nearly every data science tool in Python.
  • Introduces the array object: an n-dimensional table of numbers (vectors, matrices, tensors).
  • Provides vectorized operations -- math applied to entire arrays at once, without writing loops.

When to reach for NumPy

  • You need fast numerical operations on arrays or matrices
  • You are working with large datasets where Python lists would be too slow
  • You need linear algebra, random number generation, or statistical summaries
  • Another library (Pandas, Scikit-learn, etc.) returns or expects a NumPy array

If your data is tabular with mixed types (strings, dates, numbers), reach for Pandas instead.

The core idea: vectorized operations

NumPy replaces slow Python loops with fast, readable one-liners:


1import numpy as np
2
3# instead of this...
4result = []
5for i in range(1000):
6    result.append(i ** 2)
7
8# ...write this
9result = np.arange(1000) ** 2

The second version is shorter, faster, and easier to read.

Vibe coding: from idea to working code

  1. Describe the analysis you want in plain English
  2. Generate code using an AI assistant (e.g., Claude Code)
  3. Run the code and inspect the output
  4. Verify and explain -- make sure you understand every section

This is how modern data scientists work. The skill is in knowing what to ask for and whether the result is correct.

Demo: describing a numerical task

Suppose you want to analyze how correlated different variables are in a dataset. You might prompt:

"Load the Iris dataset from scikit-learn. Compute the correlation matrix of the four numeric features using NumPy. Then plot it as a heatmap with Seaborn, labeling axes with feature names."

Notice: the prompt names specific tools (NumPy, Seaborn, scikit-learn) and describes the goal, not the syntax.

Demo: generated code


1import numpy as np
2import seaborn as sns
3import matplotlib.pyplot as plt
4from sklearn.datasets import load_iris
5
6iris = load_iris()
7corr = np.corrcoef(iris.data, rowvar=False)
8
9plt.figure(figsize=(6, 5))
10sns.heatmap(corr, annot=True, fmt=".2f",
11            xticklabels=iris.feature_names,
12            yticklabels=iris.feature_names,
13            cmap="coolwarm")
14plt.title("Iris Feature Correlations")
15plt.tight_layout()
16plt.show()

AI generated this in seconds. Your job is to understand what it does.

Verify and explain

Walk through the generated code and answer:

  1. Where does the data come from? (load_iris() -- a built-in scikit-learn dataset)
  2. What does np.corrcoef compute? (Pearson correlation coefficients between columns)
  3. Why rowvar=False? (Tells NumPy that columns are variables, rows are observations)
  4. What does the heatmap show? (Which features move together vs. independently)

If you cannot answer these questions, you are not ready to move on.

Try it yourself

Open a Colab notebook and use an AI assistant to generate code for the following task:

"Create a 1000-element array of random numbers drawn from a normal distribution. Compute the mean and standard deviation. Then plot a histogram with 30 bins and overlay a vertical line at the mean."

After the code runs, explain each line to a partner or in a markdown cell.

Building your toolkit intuition

Ask yourself:

  • Is my data a table with column names? --> Pandas
  • Do I need fast math on arrays of numbers? --> NumPy
  • Am I fitting a model or classifier? --> Scikit-learn
  • Do I need a plot? --> Matplotlib or Seaborn
  • Am I exploring high-dimensional structure? --> HyperTools

When in doubt, describe your goal to an AI assistant and let it pick the library.

Before you move on

Before moving on: read the generated code and explain in your own words what each section does and why.

  • If a line uses a function you have never seen, look up what it returns.
  • If you cannot explain why a step is there, you do not yet understand the analysis.
  • Understanding beats memorization. The AI can write the code -- only you can judge whether it answers your question.

Questions? Want to chat more?

📧 Email me
💬 Join our Slack
💁 Come to office hours
  • Check the course schedule for what's coming next