Before discussing how to estimate byte count estimates in flow slices, we show why a simpler solution does not work. We could have the byte counter in the flow entry just count the total number of bytes in the packets seen once the flow record is created. Just like with the packet counter, we need an additive correction to account for the packets missed before the creation of the entry. We can get an unbiased estimate for the number of packets missed, but not for their total size, because we do not know their sizes. We could assume that the packet sizes are uniform within the flow, but this would lead to systematic biases because they are not. As the proof of shows, storing the size of the sampled packet that led to the creation of the entry would solve the problem because using it to estimate the total number of bytes in the packets not counted does lead to an unbiased estimator. But this would require another entry in the flow record. Instead, we store this information in the byte counter itself by initializing to when the entry is created ( is the size in bytes of the sampled packet). Let be the number of bytes of the flow at the input of the flow slicing algorithm.
Proof: By induction on the number of packets in the flow . Let for from to be the sizes of the individual packets. By definition the number of bytes in the flow is . For convenience of notation, we index the packet sizes in reverse order, so will be the size of the last packet and the size of the first one.
Base case If s=1, the only packet is sampled with probability and in that case it is counted bytes. With probability , it is not sampled (and it counts as 0). Thus .
Inductive step By induction hypothesis, we know that if the first
packet is not sampled we are left with the last packets and
. If the first packet gets sampled, we count it as
and we count the rest exactly because the flow slice length
and the inactivity timeout
are larger than the bin
size.
If we sample packets randomly with probability before applying the flow slicing algorithm, we will want to estimate the number of bytes at the input of the packet sampling stage. Since , it is easy to show that is an unbiased estimator for .