An intro to Polars

This introduction to Polars is my attempt to make it easy for future me to recollect what I have learnt during Polars training sessions provided by Quansight.

Right, let’s get to it. First set up a virtual environment. Then go through the examples below.

Set up a virtual environment

uv venv --seed;
source .venv/bin/activate;
python -m pip install uv;
python -m uv pip install jupyter pandas polars pyarrow;

Create a DataFrame

Basic

import polars as pl

pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})

Specify column types

pl.DataFrame(
    {'key': ['A', 'B'], 'value': [1, 2]},
    schema_overrides={
        'key': pl.String,
        'value': pl.Int16,
    }
)

Reading from a file eagerly

pl.read_parquet('dataset.parquet')

Reading from a file lazily

df = pl.scan_parquet('dataset.parquet')
df.collect()

Working with DataFrame columns

Select columns

import polars.selectors as cs

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.select('key')
df.select(cs.all() - cs.numeric())

Add/replace columns

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.with_columns(
    keyvalue=pl.col('key') + pl.col('value').cast(pl.String)
)

Drop columns

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.drop('value')

Working with DataFrame rows

Select rows

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.filter(pl.col('value') > 1)

Select every N rows

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df[::2]             # Gather every 2nd row
df.gather_every(2)  # Gather every 2nd row

Adding a row index column similar to what is used for pandas DataFrames

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.with_row_index()

Non-primitive data type columns

Lists

Creating lists from values

pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9], 'z': [[1, 2], [6, 2], [-2, 9]]})

Creating lists using values from other columns

df = pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9]})
df.with_columns(x_y=pl.concat_list('x', 'y'))

Processing lists values

df = pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9], 'z': [[1, 2], [6, 2], [-2, 9]]})
df.with_columns(z_mean=pl.col('z').list.mean())

Structs

Creating DataFrames with struct columns

df = pl.DataFrame({'cars': [{'make': 'Audi', 'year': 2020}, {'make': 'Volkswagen', 'year': 2024}]})

Unnesting a struct column

df.unnest('cars')

Selecting a field of a struct column

df.select(pl.col('cars').struct.field('make'))

Viewing the schema of a DataFrame containing struct columns

df.schema

Arrays

pl.Array is used to represent a fixed-size collection of values. Conversely pl.List is used to represent a variable-size collection of values.

pl.DataFrame(
    {
        'friends': [
            ['Mark', 'Mary'],
            ['John', 'Jane'],
        ],
    },
    schema={
        'friends': pl.Array(pl.String, 2),
    },
)

Aggregations

Mean of the values in a column

df = pl.scan_parquet("../titanic.parquet")
df.select('survived').mean().collect()

Mean of the values in a column grouped by values in another column

df = pl.scan_parquet("../titanic.parquet")
df.group_by('class').agg(pl.col('survived').mean()).collect()

Mean of the values in a column grouped by values in another column and joined back into the initial DataFrame

df = pl.scan_parquet("../titanic.parquet")
df.select(
    'class',
    'survived',
    class_mean_survival = pl.col('survived').mean().over('class')
).collect()

Handling missing/invalid values

Null vs NaN in Polars

In Polars there is:

Counting values when some are missing/invalid

df = pl.scan_parquet("../titanic.parquet")
df.group_by('deck').len().collect()

Dropping missing/invalid values

df = pl.scan_parquet("../titanic.parquet")
df.drop_nulls().collect()
df.filter(pl.col('deck').is_not_null()).collect()

Working with multiple DataFrames

Joining DataFrames

df1 = pl.DataFrame({'x': [0.2, 1.3, 9.1], 'y': [-9.2, 88.2, 1.5]})
df2 = pl.DataFrame({'x': [9.1, 0.2], 'z': [13124.0, 559.3]})
df1.join(df2, on='x')

Concatenating DataFrames (vertically)

df1 = pl.DataFrame({'x': [0.2, 1.3, 9.1], 'y': [-9.2, 88.2, 1.5]})
df2 = pl.DataFrame({'x': [9.1, 0.2], 'z': [13124.0, 559.3]})
pl.concat([df1, df2], how='diagonal')

Categorical data

# Use a StringCache for the code block below in order to map strings to the same uints when
# creating df1, df2 and pl.concat([df1, df2])
#
# Alternatively, use pl.enable_string_cache() to enable the global string cache if you do not
# have a large number of strings.
with pl.StringCache():
    df1 = pl.DataFrame(
        {"a": ["blue", "blue", "green"], "b": [4, 5, 6]},
        schema_overrides={"a": pl.Categorical},
    )
    df2 = pl.DataFrame(
        {"a": ["green", "green", "blue"], "b": [4, 5, 6]},
        schema_overrides={"a": pl.Categorical},
    )
    df1.with_columns(pl.col('a').to_physical().name.suffix('_physical'))
    df2.with_columns(pl.col('a').to_physical().name.suffix('_physical'))
    pl.concat([df1, df2])

Restricting values to Enum values

s = pl.Series(["flower", "tree", "flower"], dtype=pl.Enum(["flower", "tree", "bonsai"]))
s.dtype

Using LazyFrames instead of (eager) DataFrames

Textual representation of query plan

import polars as pl

df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
print(df.explain())

Digraph representation of query plan

df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.show_graph()

Executing row-wise operations

In order to execute some row-wise operations when using Polars one might need to use nested data types.

df = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.select(pl.cum_sum_horizontal(pl.all())).unnest('cum_sum')

Streaming

df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.with_columns(c=pl.col('a') + pl.col('b')).collect(engine='streaming')

Miscellaneous

· polars, python, data, analysis