An intro to Polars

This introduction to Polars is my attempt to make it easy for future me to recollect what I have learnt during Polars training sessions provided by Quansight.

Set up a virtual environment
Create a DataFrame
Working with DataFrame columns
Working with DataFrame rows
Non-primitive data type columns
Aggregations
Handling missing/invalid values
Working with multiple DataFrames
- Joining DataFrames
- Concatenating DataFrames (vertically)
Categorical data
- Restricting values to Enum values
Using LazyFrames instead of (eager) DataFrames
- Textual representation of query plan
- Digraph representation of query plan
Executing row-wise operations
Streaming
Working with timeseries
Miscellaneous

Right, let’s get to it. First set up a virtual environment. Then go through the examples below.

Set up a virtual environment

uv venv --seed;
source .venv/bin/activate;
python -m pip install uv;
python -m uv pip install jupyter pandas polars pyarrow;

Create a DataFrame

Basic

import polars as pl

pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})

Specify column types

pl.DataFrame(
    {'key': ['A', 'B'], 'value': [1, 2]},
    schema_overrides={
        'key': pl.String,
        'value': pl.Int16,
    }
)

Reading from a file eagerly

pl.read_parquet('dataset.parquet')

Reading from a file lazily

df = pl.scan_parquet('dataset.parquet')
df.collect()

Working with DataFrame columns

Select columns

import polars.selectors as cs

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.select('key')
df.select(cs.all() - cs.numeric())

Add/replace columns

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.with_columns(
    keyvalue=pl.col('key') + pl.col('value').cast(pl.String)
)

Drop columns

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.drop('value')

Working with DataFrame rows

Select rows

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.filter(pl.col('value') > 1)

Select every N rows

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df[::2]             # Gather every 2nd row
df.gather_every(2)  # Gather every 2nd row

Adding a row index column similar to what is used for pandas DataFrames

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.with_row_index()

Non-primitive data type columns

Lists

Creating lists from values

pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9], 'z': [[1, 2], [6, 2], [-2, 9]]})

Creating lists using values from other columns

df = pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9]})
df.with_columns(x_y=pl.concat_list('x', 'y'))

Processing lists values

df = pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9], 'z': [[1, 2], [6, 2], [-2, 9]]})
df.with_columns(z_mean=pl.col('z').list.mean())

Structs

Creating DataFrames with struct columns

df = pl.DataFrame({'cars': [{'make': 'Audi', 'year': 2020}, {'make': 'Volkswagen', 'year': 2024}]})

Unnesting a struct column

df.unnest('cars')

Selecting a field of a struct column

df.select(pl.col('cars').struct.field('make'))

Viewing the schema of a DataFrame containing struct columns

df.schema

Arrays

pl.Array is used to represent a fixed-size collection of values. Conversely pl.List is used to represent a variable-size collection of values.

pl.DataFrame(
    {
        'friends': [
            ['Mark', 'Mary'],
            ['John', 'Jane'],
        ],
    },
    schema={
        'friends': pl.Array(pl.String, 2),
    },
)

Aggregations

Mean of the values in a column

df = pl.scan_parquet("../titanic.parquet")
df.select('survived').mean().collect()

Mean of the values in a column grouped by values in another column

df = pl.scan_parquet("../titanic.parquet")
df.group_by('class').agg(pl.col('survived').mean()).collect()

Mean of the values in a column grouped by values in another column and joined back into the initial DataFrame

df = pl.scan_parquet("../titanic.parquet")
df.select(
    'class',
    'survived',
    class_mean_survival = pl.col('survived').mean().over('class')
).collect()

Handling missing/invalid values

Null vs NaN in Polars

In Polars there is:

null: missing data.
nan: floating point number, which results from e.g. 0/0.

Counting values when some are missing/invalid

df = pl.scan_parquet("../titanic.parquet")
df.group_by('deck').len().collect()

Dropping missing/invalid values

df = pl.scan_parquet("../titanic.parquet")
df.drop_nulls().collect()
df.filter(pl.col('deck').is_not_null()).collect()

Working with multiple DataFrames

Joining DataFrames

df1 = pl.DataFrame({'x': [0.2, 1.3, 9.1], 'y': [-9.2, 88.2, 1.5]})
df2 = pl.DataFrame({'x': [9.1, 0.2], 'z': [13124.0, 559.3]})
df1.join(df2, on='x')

Concatenating DataFrames (vertically)

df1 = pl.DataFrame({'x': [0.2, 1.3, 9.1], 'y': [-9.2, 88.2, 1.5]})
df2 = pl.DataFrame({'x': [9.1, 0.2], 'z': [13124.0, 559.3]})
pl.concat([df1, df2], how='diagonal')

Categorical data

# Use a StringCache for the code block below in order to map strings to the same uints when
# creating df1, df2 and pl.concat([df1, df2])
#
# Alternatively, use pl.enable_string_cache() to enable the global string cache if you do not
# have a large number of strings.
with pl.StringCache():
    df1 = pl.DataFrame(
        {"a": ["blue", "blue", "green"], "b": [4, 5, 6]},
        schema_overrides={"a": pl.Categorical},
    )
    df2 = pl.DataFrame(
        {"a": ["green", "green", "blue"], "b": [4, 5, 6]},
        schema_overrides={"a": pl.Categorical},
    )
    df1.with_columns(pl.col('a').to_physical().name.suffix('_physical'))
    df2.with_columns(pl.col('a').to_physical().name.suffix('_physical'))
    pl.concat([df1, df2])

Restricting values to Enum values

s = pl.Series(["flower", "tree", "flower"], dtype=pl.Enum(["flower", "tree", "bonsai"]))
s.dtype

Using LazyFrames instead of (eager) DataFrames

Using LazyFrames enables lazy evaluation makes it possible for optimizations to be applied automatically instead of requiring hand-rolled optimizations to be used instead.
If you use expressions the lazy mode should be equivalent to the eager mode (except the need to call .collect() at the end).

Textual representation of query plan

import polars as pl

df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
print(df.explain())

Digraph representation of query plan

df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.show_graph()

Executing row-wise operations

In order to execute some row-wise operations when using Polars one might need to use nested data types.

df = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.select(pl.cum_sum_horizontal(pl.all())).unnest('cum_sum')

Streaming

df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.with_columns(c=pl.col('a') + pl.col('b')).collect(engine='streaming')

Working with timeseries

Creating datetime Series

pl.date_range(start=date(2025, 1, 1), end=date(2025, 1, 31), interval='1d')
pl.Series(['2025-01-01T01:43', '2025-01-03T18:44']).str.to_datetime()
pl.Series(["2025 January 01 01:43", "2025 January 03 18:44"]).str.to_datetime('%Y %B %d %H:%M')

Filtering based on datetimes

df = pl.DataFrame({
  'now': pl.date_range(start=date(2025, 1, 1), end=date(2025, 1, 31), interval='1d', eager=True)
})
df.filter(pl.col('now') > pl.date(2025, 1, 10))
df.filter(pl.col('now').dt.day() == 10)

Datetime difference between consecutive rows

df = pl.DataFrame({
  'now': pl.date_range(start=date(2025, 1, 1), end=date(2025, 1, 31), interval='1d', eager=True)
})
df['now'].diff()

Handling time zones

# Time stamp and time zone (if set) are stored separately
ser = pl.Series(['2025-01-01T01:43', '2025-01-03T18:44']).str.to_datetime()                           # time zone unaware
ser = pl.Series(['2025-01-01T01:43', '2025-01-03T18:44']).str.to_datetime(time_zone='Europe/London')  # time zone aware
# Change time zone w/o changing underlying timestamp
ser = ser.dt.convert_time_zone('Asia/Kathmandu')
# Change timestamp ignoring current time zone
ser = ser.dt.replace_time_zone('Asia/Kathmandu')
# Unset time zone
ser = ser.dt.replace_time_zone(time_zone=None)

Daylight Saving Time (DST)

# Convert a Series into a DataFrame
df = pl.datetime_range(
    date(2020, 10, 25),
    datetime(2020, 10, 25, 4),
    "1h",
    time_zone="Europe/London",
    eager=True,
).to_frame('date')
df = df.with_columns(
  # Determine the DST offset
  dst_offset=pl.col('date').dt.dst_offset(),
  # Add 1d to date ignoring DST
  day_plus_1d=pl.col('date').dt.offset_by('1d'),
  # Add 24h (i.e. 1d) to date considering DST
  day_plus_24h=pl.col('date').dt.offset_by('24h'),
  # Handle ambiguities due to DST explicitly
  replaced_time_zone=pl.col('date').dt.replace_time_zone(
    'Europe/London',
    ambiguous=pl.Series(['earliest', 'earliest', 'latest', 'latest', 'latest', 'latest']),
  )
)

Grouping data over time

df = pl.scan_csv("../assets.csv", try_parse_dates=True)
df.group_by_dynamic(
  'date',
  every='1mo',
  group_by='symbol',
).agg(pl.mean('price')).collect()

Rolling computations

# Using rolling_mean_by()
df = pl.scan_csv("../assets.csv", try_parse_dates=True)
(
  df.with_columns(pl.col('price').rolling_mean_by('date', window_size='5d'))
  .collect()
)
# Using rolling()
(
  df
  .filter(pl.col('symbol') == 'ABBV')
  .rolling('date', period='5d')
  .agg(pl.col('price').mean().alias('rolling_price'))
  .collect()
)
# Rolling mean for each group using over()
(
  df.with_columns(pl.col('price').rolling_mean_by('date', window_size='5d').over('symbol'))
  .collect()
)
# Exponentially weighted averages
(
  df.with_columns(pl.col('price').ewm_mean_by('date', half_life='10d'))
  .collect()
)

Upsampling

df = pl.DataFrame(
    {
        'ts': pl.date_range(date(2025, 5, 1), date(2025, 5, 10), interval='4d', eager=True),
        'value': [4.0, 1.5, 7.0]
    }
)
# Without filling in missing values
df.upsample('ts', every='1d')
# Filling in missing values
df.upsample('ts', every='1d').interpolate()

Miscellaneous

Data is stored in a columnar (Arrow) format when using Polars.
In Polars objects are usually immutable.
In order to improve execution time performance use non-eval approach first, eval second, map_batches third and map_elements fourth (because of jumping btw. Python and a Rust binary)
Use struct column type if the format/structure of a column is fixed. Otherwise use object type.
pl.Series()._get_buffers() -> underlying representation.
.collect([new_]streaming=True, gpu=True) for using streaming and/or GPUs when collecting results from a lazy DataFrame.
Using sorted data enables Polars to use some optimizations which reduce execution time. Use set_sorted to tell Polars that data is sorted (Polars won’t check, so use w/ care).

March 29, 2025 · polars, python, data, analysis