A trick for pandas groupby!

How to GroupBy an increasing column such as a counter? [In pandas]

I often work with sensor measurements such as those from car CAN-bus. These measurements are often collected over multiple sessions, but rarely these sessions are indicated by a proper identifier. Sometimes timestamps (containing date and time) are available and one can infer based on the time interval between the recordings.
Today started work on a new dataset and this dataset instead of proper timestamp provides a column that is only a counter that starts counting at the beginning of recording, of course the measurements are supposed to be at 1Hz, therefore the counter is also a measure of time.
Details aside, there are times I encounter a similar dataset and I often like to separate these recording sessions or whatever they may represent in your dataset. For example to do a groupby1 and compute some statistics or any other complicated operations.
I am not overly familiar with all every feature in pandas or numpy so there may be an easier to tackle this problem, but here is how I usually solve the issue. Also one can always write a for loop but that is not an elegant solution.
I will generate some synthetic data and explain the process.

We do the usual imports and set the seed.

import numpy as np
import pandas as pd
np.random.seed(1)

To generate data first we sample 4 random integers between 3 and 7. These numbers are picked arbitrary to keep the example small, of course you can experiment with larger values. In this example simulate having 4 recording sessions each between 3 to 7 rows of data (or measurements).

counter_max = np.random.randint(3,7,4)
array([4, 6, 3, 3])

For each session we create counter values and two randomly sampled dummy variables. Then we create a dataframe from them.

inner_id = np.concatenate([np.array(range(x)) for x in counter_max])
var_a = np.concatenate([np.random.rand(x) for x in counter_max])
var_b = np.concatenate([np.random.rand(x) for x in counter_max])

df = pd.DataFrame({'inner_id':inner_id,
              'var_a': var_a,
              'var_b': var_b}
)

We use the method diff()2 to detect transitions between each session. diff when used with no arguments shifts the column values by one row and subtracts them from original values.

df['diff'] = df.inner_id.diff()

Here is the result:

df.head(7)
indexinner_idvar_avar_bdiff
000.0001140.140387NaN
110.3023330.1981011.0
220.1467560.8007451.0
330.0923390.9682621.0
400.1862600.313424-3.0
510.3455610.6923231.0
620.3967670.8763891.0

Next we need to extract the transition or breakpoints. Transitions occur only when the result of diff is negative. We also prepend 0 and append last index+1 (this is needed because the way pandas.cut() works).

break_points = np.concatenate(
    [np.array([0]),
     np.array(df[df.inner_id.diff() < 0].index),
     np.array([df.index[-1]+1])
    ], axis=0)

We construct a pd.IntervalIndex object from our breakpoints.

interval_idx = pd.IntervalIndex.from_breaks(break_points, closed='left')

Method cut3 is generally use for binning or categorizing numerical ranges. Here we use it to categorize index values of our dataframe and then immediately we throw away categories and only keep the codes or numerical value of the categories.

df['id'] = pd.cut(df.index, bins=interval_idx,include_lowest=True).codes

Here is the resulting dataframe. Now we can easily do df.groupby('id')

df
indexinner_idvar_avar_bdiffid
000.0001140.140387NaN0
110.3023330.1981011.00
220.1467560.8007451.00
330.0923390.9682621.00
400.1862600.313424-3.01
510.3455610.6923231.01
620.3967670.8763891.01
730.5388170.8946071.01
840.4191950.0850441.01
950.6852200.0390551.01
1000.2044520.169830-5.02
1110.8781170.8781431.02
1220.0273880.0983471.02
1300.6704680.421108-2.03
1410.4173050.9578901.03
1520.5586900.5331651.03

  1. pandas.DataFrame.groupby() ↩︎

  2. pandas.diff() ↩︎

  3. pandas.cut() ↩︎

Independent Researcher, CTO

My research interests include time series classification and privacy-preserving machine learning.