Pandas Cheat Sheet

If you are new to Pandas feel free to read Introduction to Pandas

I’ve assembled some pandas code snippets

Reading Data

Reading CSV

import pandas as pd

# read from csv
df = pd.read_csv("path_to_file")

Can also be textfiles. file suffix is ignored

The default limiter for comma separated value files is the comma. If you have data with another delimiter you can specify it via:

delimiter=";"

If your data has no header you can pass header=None into the function

df = pd.read_csv("./aoc_day_01_data.txt", header=None)

With skiprows you can start reading in at any row

skiprows=8

Sometimes you need to alter the encoding as well:

encoding="cp1252"

Reading Excel

You can read excel files as well but you need to install

pip install openpyxl

df = pd.read_excel("./my_excel_sheet.xlsx")

With sheet_name you can select the individual sheet:

sheet_name="my_sheet_1"

Inspecting data

Basic information

df.describe()

Length

len(df)

showing entries

df.head()

df.tail(10)

Indexing

df['A']

gives you column A

iloc gives you entries based on numerical index

#      [row, column]
df.iloc[0,   0]

#     [row, column]
df.loc[:, :]

Data Cleaning

Dropping columns

del df["column_name"]

Renaming columns

df.columns = ["new_column_name", ...]

Comparing columns

df['increased'] = (df['shifted'] > df['original'])

Shifting columns

df['shifted'] = df['original'].shift(-1)

Splitting

Splitting strings into individual columns

df = pd.DataFrame(df["original"].str.split('').tolist())

Counting and Calculating

Summing columns

df["value"].sum()

Cumulative sum

df["aim"].cumsum()

Rolling sum

indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
df["rolling_sum"] = df.original.rolling(window=indexer).sum()

Counting value occurence

df['increased'].value_counts(dropna=False)

Counting occurrences for all columns

df = pd.concat([df[column].value_counts() for column in df], axis = 1)

Convert column to datetime

df.loc[:, 'date'] = pd.to_datetime(df.loc[:, 'date'])

Convert datetime to minutes since midnight

df_train.loc[:, 'msm'] = df_train.loc[:, "date"].dt.hour * 60 + df_train.loc[:, "date"].dt.minute