If you are new to Pandas feel free to read Introduction to Pandas
I’ve assembled some pandas code snippets
Table of Contents
Reading Data
Reading CSV
import pandas as pd # read from csv df = pd.read_csv("path_to_file")
Can also be textfiles. file suffix is ignored
The default limiter for comma separated value files is the comma. If you have data with another delimiter you can specify it via:
delimiter=";"
If your data has no header you can pass header=None into the function
df = pd.read_csv("./aoc_day_01_data.txt", header=None)
With skiprows you can start reading in at any row
skiprows=8
Sometimes you need to alter the encoding as well:
encoding="cp1252"
Reading Excel
You can read excel files as well but you need to install
pip install openpyxl
df = pd.read_excel("./my_excel_sheet.xlsx")
With sheet_name you can select the individual sheet:
sheet_name="my_sheet_1"
Inspecting data
Basic information
df.describe()
Length
len(df)
showing entries
df.head()
df.tail(10)
Indexing
df['A']
gives you column A
iloc gives you entries based on numerical index
# [row, column] df.iloc[0, 0]
# [row, column] df.loc[:, :]
Data Cleaning
Dropping columns
del df["column_name"]
Renaming columns
df.columns = ["new_column_name", ...]
Comparing columns
df['increased'] = (df['shifted'] > df['original'])
Shifting columns
df['shifted'] = df['original'].shift(-1)
Splitting
Splitting strings into individual columns
df = pd.DataFrame(df["original"].str.split('').tolist())
Counting and Calculating
Summing columns
df["value"].sum()
Cumulative sum
df["aim"].cumsum()
Rolling sum
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3) df["rolling_sum"] = df.original.rolling(window=indexer).sum()
Counting value occurence
df['increased'].value_counts(dropna=False)
Counting occurrences for all columns
df = pd.concat([df[column].value_counts() for column in df], axis = 1)
Convert column to datetime
df.loc[:, 'date'] = pd.to_datetime(df.loc[:, 'date'])
Convert datetime to minutes since midnight
df_train.loc[:, 'msm'] = df_train.loc[:, "date"].dt.hour * 60 + df_train.loc[:, "date"].dt.minute