An intro to Polars

This introduction to Polars is my attempt to make it easy for future me to recollect what I have learnt during Polars training sessions provided by Quansight.

Right, let’s get to it. First set up a virtual environment. Then go through the examples below.

Set up a virtual environment

uv venv --seed;
source .venv/bin/activate;
python -m pip install uv;
python -m uv pip install jupyter pandas polars pyarrow;

Create a DataFrame

Basic

import polars as pl

pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})

Specify column types

pl.DataFrame(
    {'key': ['A', 'B'], 'value': [1, 2]},
    schema_overrides={
        'key': pl.String,
        'value': pl.Int16,
    }
)

Reading from a file eagerly

pl.read_parquet('dataset.parquet')

Reading from a file lazily

df = pl.scan_parquet('dataset.parquet')
df.collect()

Working with DataFrame columns

Select columns

import polars.selectors as cs

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.select('key')
df.select(cs.all() - cs.numeric())

Add/replace columns

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.with_columns(
    keyvalue=pl.col('key') + pl.col('value').cast(pl.String)
)

Drop columns

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.drop('value')

Working with DataFrame rows

Select rows

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.filter(pl.col('value') > 1)

Select every N rows

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df[::2]             # Gather every 2nd row
df.gather_every(2)  # Gather every 2nd row

Adding a row index column similar to what is used for pandas DataFrames

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.with_row_index()

Non-primitive data type columns

Lists

Creating lists from values

pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9], 'z': [[1, 2], [6, 2], [-2, 9]]})

Creating lists using values from other columns

df = pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9]})
df.with_columns(x_y=pl.concat_list('x', 'y'))

Processing lists values

df = pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9], 'z': [[1, 2], [6, 2], [-2, 9]]})
df.with_columns(z_mean=pl.col('z').list.mean())

Structs

Creating DataFrames with struct columns

df = pl.DataFrame({'cars': [{'make': 'Audi', 'year': 2020}, {'make': 'Volkswagen', 'year': 2024}]})

Unnesting a struct column

df.unnest('cars')

Selecting a field of a struct column

df.select(pl.col('cars').struct.field('make'))

Viewing the schema of a DataFrame containing struct columns

df.schema

Arrays

pl.Array is used to represent a fixed-size collection of values. Conversely pl.List is used to represent a variable-size collection of values.

pl.DataFrame(
    {
        'friends': [
            ['Mark', 'Mary'],
            ['John', 'Jane'],
        ],
    },
    schema={
        'friends': pl.Array(pl.String, 2),
    },
)

Aggregations

Mean of the values in a column

df = pl.scan_parquet("../titanic.parquet")
df.select('survived').mean().collect()

Mean of the values in a column grouped by values in another column

df = pl.scan_parquet("../titanic.parquet")
df.group_by('class').agg(pl.col('survived').mean()).collect()

Mean of the values in a column grouped by values in another column and joined back into the initial DataFrame

df = pl.scan_parquet("../titanic.parquet")
df.select(
    'class',
    'survived',
    class_mean_survival = pl.col('survived').mean().over('class')
).collect()

Miscellaneous

· polars, python, data, analysis