An intro to Polars

This introduction to Polars is my attempt to make it easy for future me to recollect what I have learnt during Polars training sessions provided by Quansight.

Right, let’s get to it. First set up a virtual environment. Then go through the examples below.

Set up a virtual environment

uv venv --seed;
source .venv/bin/activate;
python -m pip install uv;
python -m uv pip install jupyter pandas polars pyarrow;

Create a DataFrame

Basic

import polars as pl

pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})

Specify column types

pl.DataFrame(
    {'key': ['A', 'B'], 'value': [1, 2]},
    schema_overrides={
        'key': pl.String,
        'value': pl.Int16,
    }
)

Reading from a file eagerly

pl.read_parquet('dataset.parquet')

Reading from a file lazily

df = pl.scan_parquet('dataset.parquet')
df.collect()

Working with DataFrame columns

Select columns

import polars.selectors as cs

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.select('key')
df.select(cs.all() - cs.numeric())

Add/replace columns

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.with_columns(
    keyvalue=pl.col('key') + pl.col('value').cast(pl.String)
)

Drop columns

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.drop('value')

Working with DataFrame rows

Select rows

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.filter(pl.col('value') > 1)

Select every N rows

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df[::2]             # Gather every 2nd row
df.gather_every(2)  # Gather every 2nd row

Adding a row index column similar to what is used for pandas DataFrames

df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.with_row_index()

Non-primitive data type columns

Lists

Creating lists from values

pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9], 'z': [[1, 2], [6, 2], [-2, 9]]})

Creating lists using values from other columns

df = pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9]})
df.with_columns(x_y=pl.concat_list('x', 'y'))

Processing lists values

df = pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9], 'z': [[1, 2], [6, 2], [-2, 9]]})
df.with_columns(z_mean=pl.col('z').list.mean())

Structs

Creating DataFrames with struct columns

df = pl.DataFrame({'cars': [{'make': 'Audi', 'year': 2020}, {'make': 'Volkswagen', 'year': 2024}]})

Unnesting a struct column

df.unnest('cars')

Selecting a field of a struct column

df.select(pl.col('cars').struct.field('make'))

Viewing the schema of a DataFrame containing struct columns

df.schema

Arrays

pl.Array is used to represent a fixed-size collection of values. Conversely pl.List is used to represent a variable-size collection of values.

pl.DataFrame(
    {
        'friends': [
            ['Mark', 'Mary'],
            ['John', 'Jane'],
        ],
    },
    schema={
        'friends': pl.Array(pl.String, 2),
    },
)

Aggregations

Mean of the values in a column

df = pl.scan_parquet("../titanic.parquet")
df.select('survived').mean().collect()

Mean of the values in a column grouped by values in another column

df = pl.scan_parquet("../titanic.parquet")
df.group_by('class').agg(pl.col('survived').mean()).collect()

Mean of the values in a column grouped by values in another column and joined back into the initial DataFrame

df = pl.scan_parquet("../titanic.parquet")
df.select(
    'class',
    'survived',
    class_mean_survival = pl.col('survived').mean().over('class')
).collect()

Handling missing/invalid values

Null vs NaN in Polars

In Polars there is:

Counting values when some are missing/invalid

df = pl.scan_parquet("../titanic.parquet")
df.group_by('deck').len().collect()

Dropping missing/invalid values

df = pl.scan_parquet("../titanic.parquet")
df.drop_nulls().collect()
df.filter(pl.col('deck').is_not_null()).collect()

Working with multiple DataFrames

Joining DataFrames

df1 = pl.DataFrame({'x': [0.2, 1.3, 9.1], 'y': [-9.2, 88.2, 1.5]})
df2 = pl.DataFrame({'x': [9.1, 0.2], 'z': [13124.0, 559.3]})
df1.join(df2, on='x')

Concatenating DataFrames (vertically)

df1 = pl.DataFrame({'x': [0.2, 1.3, 9.1], 'y': [-9.2, 88.2, 1.5]})
df2 = pl.DataFrame({'x': [9.1, 0.2], 'z': [13124.0, 559.3]})
pl.concat([df1, df2], how='diagonal')

Categorical data

# Use a StringCache for the code block below in order to map strings to the same uints when
# creating df1, df2 and pl.concat([df1, df2])
#
# Alternatively, use pl.enable_string_cache() to enable the global string cache if you do not
# have a large number of strings.
with pl.StringCache():
    df1 = pl.DataFrame(
        {"a": ["blue", "blue", "green"], "b": [4, 5, 6]},
        schema_overrides={"a": pl.Categorical},
    )
    df2 = pl.DataFrame(
        {"a": ["green", "green", "blue"], "b": [4, 5, 6]},
        schema_overrides={"a": pl.Categorical},
    )
    df1.with_columns(pl.col('a').to_physical().name.suffix('_physical'))
    df2.with_columns(pl.col('a').to_physical().name.suffix('_physical'))
    pl.concat([df1, df2])

Restricting values to Enum values

s = pl.Series(["flower", "tree", "flower"], dtype=pl.Enum(["flower", "tree", "bonsai"]))
s.dtype

Using LazyFrames instead of (eager) DataFrames

Textual representation of query plan

import polars as pl

df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
print(df.explain())

Digraph representation of query plan

df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.show_graph()

Executing row-wise operations

In order to execute some row-wise operations when using Polars one might need to use nested data types.

df = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.select(pl.cum_sum_horizontal(pl.all())).unnest('cum_sum')

Streaming

df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.with_columns(c=pl.col('a') + pl.col('b')).collect(engine='streaming')

Working with timeseries

Creating datetime Series

pl.date_range(start=date(2025, 1, 1), end=date(2025, 1, 31), interval='1d')
pl.Series(['2025-01-01T01:43', '2025-01-03T18:44']).str.to_datetime()
pl.Series(["2025 January 01 01:43", "2025 January 03 18:44"]).str.to_datetime('%Y %B %d %H:%M')

Filtering based on datetimes

df = pl.DataFrame({
  'now': pl.date_range(start=date(2025, 1, 1), end=date(2025, 1, 31), interval='1d', eager=True)
})
df.filter(pl.col('now') > pl.date(2025, 1, 10))
df.filter(pl.col('now').dt.day() == 10)

Datetime difference between consecutive rows

df = pl.DataFrame({
  'now': pl.date_range(start=date(2025, 1, 1), end=date(2025, 1, 31), interval='1d', eager=True)
})
df['now'].diff()

Handling time zones

# Time stamp and time zone (if set) are stored separately
ser = pl.Series(['2025-01-01T01:43', '2025-01-03T18:44']).str.to_datetime()                           # time zone unaware
ser = pl.Series(['2025-01-01T01:43', '2025-01-03T18:44']).str.to_datetime(time_zone='Europe/London')  # time zone aware
# Change time zone w/o changing underlying timestamp
ser = ser.dt.convert_time_zone('Asia/Kathmandu')
# Change timestamp ignoring current time zone
ser = ser.dt.replace_time_zone('Asia/Kathmandu')
# Unset time zone
ser = ser.dt.replace_time_zone(time_zone=None)

Daylight Saving Time (DST)

# Convert a Series into a DataFrame
df = pl.datetime_range(
    date(2020, 10, 25),
    datetime(2020, 10, 25, 4),
    "1h",
    time_zone="Europe/London",
    eager=True,
).to_frame('date')
df = df.with_columns(
  # Determine the DST offset
  dst_offset=pl.col('date').dt.dst_offset(),
  # Add 1d to date ignoring DST
  day_plus_1d=pl.col('date').dt.offset_by('1d'),
  # Add 24h (i.e. 1d) to date considering DST
  day_plus_24h=pl.col('date').dt.offset_by('24h'),
  # Handle ambiguities due to DST explicitly
  replaced_time_zone=pl.col('date').dt.replace_time_zone(
    'Europe/London',
    ambiguous=pl.Series(['earliest', 'earliest', 'latest', 'latest', 'latest', 'latest']),
  )
)

Grouping data over time

df = pl.scan_csv("../assets.csv", try_parse_dates=True)
df.group_by_dynamic(
  'date',
  every='1mo',
  group_by='symbol',
).agg(pl.mean('price')).collect()

Rolling computations

# Using rolling_mean_by()
df = pl.scan_csv("../assets.csv", try_parse_dates=True)
(
  df.with_columns(pl.col('price').rolling_mean_by('date', window_size='5d'))
  .collect()
)
# Using rolling()
(
  df
  .filter(pl.col('symbol') == 'ABBV')
  .rolling('date', period='5d')
  .agg(pl.col('price').mean().alias('rolling_price'))
  .collect()
)
# Rolling mean for each group using over()
(
  df.with_columns(pl.col('price').rolling_mean_by('date', window_size='5d').over('symbol'))
  .collect()
)
# Exponentially weighted averages
(
  df.with_columns(pl.col('price').ewm_mean_by('date', half_life='10d'))
  .collect()
)

Upsampling

df = pl.DataFrame(
    {
        'ts': pl.date_range(date(2025, 5, 1), date(2025, 5, 10), interval='4d', eager=True),
        'value': [4.0, 1.5, 7.0]
    }
)
# Without filling in missing values
df.upsample('ts', every='1d')
# Filling in missing values
df.upsample('ts', every='1d').interpolate()

Miscellaneous

· polars, python, data, analysis