An intro to Polars
This introduction to Polars is my attempt to make it easy for future me to recollect what I have learnt during Polars training sessions provided by Quansight.
- Set up a virtual environment
- Create a DataFrame
- Working with DataFrame columns
- Working with DataFrame rows
- Non-primitive data type columns
- Aggregations
- Handling missing/invalid values
- Working with multiple DataFrames
- Categorical data
- Using LazyFrames instead of (eager) DataFrames
- Executing row-wise operations
- Streaming
- Miscellaneous
Right, let’s get to it. First set up a virtual environment. Then go through the examples below.
Set up a virtual environment
uv venv --seed;
source .venv/bin/activate;
python -m pip install uv;
python -m uv pip install jupyter pandas polars pyarrow;
Create a DataFrame
Basic
import polars as pl
pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
Specify column types
pl.DataFrame(
{'key': ['A', 'B'], 'value': [1, 2]},
schema_overrides={
'key': pl.String,
'value': pl.Int16,
}
)
Reading from a file eagerly
pl.read_parquet('dataset.parquet')
Reading from a file lazily
df = pl.scan_parquet('dataset.parquet')
df.collect()
Working with DataFrame columns
Select columns
import polars.selectors as cs
df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.select('key')
df.select(cs.all() - cs.numeric())
Add/replace columns
df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.with_columns(
keyvalue=pl.col('key') + pl.col('value').cast(pl.String)
)
Drop columns
df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.drop('value')
Working with DataFrame rows
Select rows
df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.filter(pl.col('value') > 1)
Select every N rows
df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df[::2] # Gather every 2nd row
df.gather_every(2) # Gather every 2nd row
Adding a row index column similar to what is used for pandas DataFrames
df = pl.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df.with_row_index()
Non-primitive data type columns
Lists
Creating lists from values
pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9], 'z': [[1, 2], [6, 2], [-2, 9]]})
Creating lists using values from other columns
df = pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9]})
df.with_columns(x_y=pl.concat_list('x', 'y'))
Processing lists values
df = pl.DataFrame({'x': [4, 1, 7], 'y': [8, 2, 9], 'z': [[1, 2], [6, 2], [-2, 9]]})
df.with_columns(z_mean=pl.col('z').list.mean())
Structs
Creating DataFrames with struct columns
df = pl.DataFrame({'cars': [{'make': 'Audi', 'year': 2020}, {'make': 'Volkswagen', 'year': 2024}]})
Unnesting a struct column
df.unnest('cars')
Selecting a field of a struct column
df.select(pl.col('cars').struct.field('make'))
Viewing the schema of a DataFrame containing struct columns
df.schema
Arrays
pl.Array is used to represent a fixed-size collection of values. Conversely pl.List is used to represent a variable-size collection of values.
pl.DataFrame(
{
'friends': [
['Mark', 'Mary'],
['John', 'Jane'],
],
},
schema={
'friends': pl.Array(pl.String, 2),
},
)
Aggregations
Mean of the values in a column
df = pl.scan_parquet("../titanic.parquet")
df.select('survived').mean().collect()
Mean of the values in a column grouped by values in another column
df = pl.scan_parquet("../titanic.parquet")
df.group_by('class').agg(pl.col('survived').mean()).collect()
Mean of the values in a column grouped by values in another column and joined back into the initial DataFrame
df = pl.scan_parquet("../titanic.parquet")
df.select(
'class',
'survived',
class_mean_survival = pl.col('survived').mean().over('class')
).collect()
Handling missing/invalid values
Null vs NaN in Polars
In Polars there is:
- null: missing data.
- nan: floating point number, which results from e.g. 0/0.
Counting values when some are missing/invalid
df = pl.scan_parquet("../titanic.parquet")
df.group_by('deck').len().collect()
Dropping missing/invalid values
df = pl.scan_parquet("../titanic.parquet")
df.drop_nulls().collect()
df.filter(pl.col('deck').is_not_null()).collect()
Working with multiple DataFrames
Joining DataFrames
df1 = pl.DataFrame({'x': [0.2, 1.3, 9.1], 'y': [-9.2, 88.2, 1.5]})
df2 = pl.DataFrame({'x': [9.1, 0.2], 'z': [13124.0, 559.3]})
df1.join(df2, on='x')
Concatenating DataFrames (vertically)
df1 = pl.DataFrame({'x': [0.2, 1.3, 9.1], 'y': [-9.2, 88.2, 1.5]})
df2 = pl.DataFrame({'x': [9.1, 0.2], 'z': [13124.0, 559.3]})
pl.concat([df1, df2], how='diagonal')
Categorical data
# Use a StringCache for the code block below in order to map strings to the same uints when
# creating df1, df2 and pl.concat([df1, df2])
#
# Alternatively, use pl.enable_string_cache() to enable the global string cache if you do not
# have a large number of strings.
with pl.StringCache():
df1 = pl.DataFrame(
{"a": ["blue", "blue", "green"], "b": [4, 5, 6]},
schema_overrides={"a": pl.Categorical},
)
df2 = pl.DataFrame(
{"a": ["green", "green", "blue"], "b": [4, 5, 6]},
schema_overrides={"a": pl.Categorical},
)
df1.with_columns(pl.col('a').to_physical().name.suffix('_physical'))
df2.with_columns(pl.col('a').to_physical().name.suffix('_physical'))
pl.concat([df1, df2])
Restricting values to Enum values
s = pl.Series(["flower", "tree", "flower"], dtype=pl.Enum(["flower", "tree", "bonsai"]))
s.dtype
Using LazyFrames instead of (eager) DataFrames
- Using LazyFrames enables lazy evaluation makes it possible for optimizations to be applied automatically instead of requiring hand-rolled optimizations to be used instead.
- If you use expressions the lazy mode should be equivalent to the eager mode (except the need to call
.collect()
at the end).
Textual representation of query plan
import polars as pl
df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
print(df.explain())
Digraph representation of query plan
df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.show_graph()
Executing row-wise operations
In order to execute some row-wise operations when using Polars one might need to use nested data types.
df = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.select(pl.cum_sum_horizontal(pl.all())).unnest('cum_sum')
Streaming
df = pl.LazyFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.with_columns(c=pl.col('a') + pl.col('b')).collect(engine='streaming')
Miscellaneous
- Data is stored in a columnar (Arrow) format when using Polars.
- In Polars objects are usually immutable.
- In order to improve execution time performance use non-eval approach first, eval second, map_batches third and map_elements fourth (because of jumping btw. Python and a Rust binary)
- Use struct column type if the format/structure of a column is fixed. Otherwise use object type.
- pl.Series()._get_buffers() -> underlying representation.
- .collect([new_]streaming=True, gpu=True) for using streaming and/or GPUs when collecting results from a lazy DataFrame.
- Using sorted data enables Polars to use some optimizations which reduce execution time. Use
set_sorted
to tell Polars that data is sorted (Polars won’t check, so use w/ care).