P4 Programmable Traffic Management Stephen Ibanez & Gordon
66 Slides933.89 KB
P4 Programmable Traffic Management Stephen Ibanez & Gordon Brebner 9/12/2018
Goals Motivating Discussions Traffic Manager Architectural Model Overview Putative P4 Extensions Moving Forward Discussions 2
What is Traffic Management? Packet Scheduling – in what order? Traffic Shaping – at what rate/pacing? Policing – is packet compliant? Drop Policy – how to avoid / deal with congestion? Packet Buffering – how to store packets? Packet Replication – how to replicate packets? Classification – how to map packets into queues? 3
Where are traffic managers used? Integrated Switch (PISA) Ingress Parser Ingress Deparser Ingress M/A Traffic Manager Egress Parser Egress M/A Egress Deparser NIC or Line Card Traffic Manager Ingress Deparser Ingress M/A Ingress Parser Ethernet Ports Host PCI Express (NIC) / Switch Fabric (Line Card) Egress Parser 4 Egress M/A Egress Deparser Traffic Manager
Why should we care about Traffic Management? Lots of different types of traffic w/ different characteristics and requirements: Characteristics: burstiness, packet sizes, flow sizes, flow rates Requirements: throughput, latency, loss, jitter, reordering, flow completion time, tail latency Network operators have a wide range of objectives: Meet all SLAs Maximize network utilization Achieve fairness Network devices are picking up more functionality More network programmability More types of traffic More TM requirements! More complicated ASICs Embrace programmability! About 50% of a modern programmable switch chip is dedicated to buffering and traffic management logic – but is not programmable! 5
Why should we care about Traffic Management? WAN links are expensive want to make best use of them by prioritizing traffic Modern congestion control algorithms rely on accurate packet pacing (e.g. BBR[1], Timely[2]) Performance isolation for thousands of VMs (millions of flows) per server [1] Cardwell, Neal, et al. "BBR: Congestion-based congestion control." Queue 14.5 (2016): 50. [2] Mittal, Radhika, et al. "TIMELY: RTT-based Congestion Control for the Datacenter." ACM SIGCOMM Computer Communication Review. Vol. 45. No. 4. ACM, 2015. [3] Saeed, Ahmed, et al. "Carousel: Scalable traffic shaping at end hosts." Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 2017. 6
Benefits of Programmable Traffic Management Benefits of an unambiguous way to define TM algorithms: Portability Formal verification and static analysis Precise way to express customer needs Benefits of having programmable TM: Innovation Differentiation Reliability Network operators can fine tune for performance Small menu of algorithms to choose from today Many possible algorithms that can be expressed 7
Existing TM algorithms are too limiting TM algorithm parameters can be derived from: Network operator static configurations Per packet information Per flow or per class information (i.e. state stored across packets) Current queue state Aggregated queue state (e.g. avg sizes, avg service rates) Other state that evolves over time Randomness Combinations of any of the above parameters ‒ Combined arithmetically ‒ Combined temporally (i.e. alternate between various policies) ‒ Combined hierarchically 8
Generic Traffic Manager Architecture Crossbar Classification & Policing & Drop Policy Non-P4-Prog PRE Output queued Each egress port may have many associated queues A NIC may not need a crossbar or PRE 9 Scheduling & Shaping Non-P4-Prog Buffer
Proposed TM Architecture Model Ingress logic Scheduling / Shaping Classification Policing Queue 1 Drop Policy Packet buffer Input Packet Classification & Policing & Drop Policy Scheduling / Shaping Queue i . Queue N Extern interface 10 . State Packet Buffer Output Packet
Ingress Logic Scheduling / Shaping Match / Action processing Classification: Decide which queue E.g. per flow queues Policing: Is packet compliant? E.g. Use PSA meters to drop packets Queue 1 Input Packet Classification & Policing & Drop Policy Drop Policy: How to drop packets during periods of perceived congestion E.g. WRED Extract scheduling metadata Extern interface to access packet buffer state 11 . Queue i . Queue N Extern interface State Packet Buffer Output Packet
Packet Buffer Scheduling / Shaping Memory space to store packets Divided into queues to isolate traffic Queue 1 Queue boundaries may be flexible Packet order in queue scheduling order Exchange packet descriptor and metadata with scheduling / shaping logic State: Size of each queue Accessible to ingress logic (read only) 12 Input Packet Classification & Policing & Drop Policy . Queue i . Queue N Extern interface State Packet Buffer Output Packet
The Push-In-First-Out (PIFO) Model [1] 2 What is a PIFO? Programmable rank computation Why is the PIFO a good model? 8 7 4 3 0 Fixed PIFO Scheduling decision made at time of enqueue helps relax timing constraints leads to efficient implementation Clear separation b/w fixed and programmable logic Can implement existing algorithms: Start Time Fair Queueing (STFQ), Least Slack-Time First (LSTF), Stop-and-Go Queueing, Minimum rate guarantees, fine grained priority scheduling, Service-Curved Earliest Deadline First (SC-EDF), Rate-Controlled Service Disciplines (RCSD) Token bucket rate limiting Can implement new algorithms 13
Scheduling / Shaping Tree PIFO Scheduling / Shaping Tree 4 3 2 1 0 Path computation deq logic enq logic Decide leaf node to enqueue into S Scheduling / Shaping Each node One PIFO Programmable enq and deq logic Potentially shared state b/w enq and deq logic Scheduling node Shaping Node t1 t01 Queue Classification & Input Scheduling Node Policing & Drop Packet Policy Implement scheduling algs Leaf nodes store descriptors, parent nodes store scheduling node refs Shaping node Implement shaping algs Shapes all nodes within its sub tree 14 enq logic S deq logic deq logic Queue i Output Packet S . Queue N 3 0 enq logic . 9 9 3 3 2 State Extern interface enq Packetlogic Buffer S compute path descriptor & metadata deq logic
Scheduling & Shaping Demonstration
Scheduling and Shaping Goal: Scheduling Enqueue 3 packets Dequeue 3 packets Shaping 16 Scheduling Scheduling Class A Class B
Enqueue Packet 1 t8 enq logic L2 deq logic t2 t6 t5 S 3 deq logic enq logic &p1 deq logic enq logic S p1 Queue 1 . Classification / Policing / Drop Policy Queue i . Queue N 17 State p1 deq logic S S L &p1 t3 t4 enq logic compute path p1 t1 t7 Shaping Node
Enqueue Packet 2 t8 2 L deq logic enq logic t2 t6 t5 S deq logic deq logic enq logic S S compute path R &p2 p2 p2 Queue 1 p2 . Classification / Policing / Drop Policy Queue i . Queue N 18 t3 t4 t4 3 &p1 enq logic t1 t7 State p1 7 enq logic&p2 deq logic S
Enqueue Packet 3 t8 2 L deq logic enq logic t2 t6 t5 S deq logic deq logic enq logic S S compute path R &p3 p3 p3 Queue 1 p2 . Classification / Policing / Drop Policy Queue i p1 . Queue N 19 t3 t4 t7 t4 3 &p1 enq logic t1 t7 State p3 7 &p2 2 enq logic&p3 deq logic S
Enqueue Packet 2 (continued) t8 2 L enq logic 1 R t2 t6 deq logic t5 S deq logic t3 t4 t7 t4 3 &p1 enq logic t1 t7 7 2 &p2 &p3 deq logic enq logic deq logic enq logic R S S compute path Queue 1 p2 . Classification / Policing / Drop Policy Queue i . Queue N 20 p1 State p3 S
Dequeue Packet 3 2 L deq logic enq logic S S compute path Queue 1 p2 . Classification / Policing / Drop Policy Queue i p1 . Queue N 21 t3 t4 t7 3 &p1 deq logic t2 t6 t5 R S t1 t7 deq logic enq logic enq logic t8 1 R State p3 2 7 &p2 &p3 deq logic enq logic S &p3
Dequeue Packet 1 t3 t4 t7 3 &p1 deq logic t2 t6 t5 L S t1 t7 deq logic enq logic enq logic t8 2 L deq logic enq logic 7 &p2 deq logic enq logic &p1 S S compute path Queue 1 p2 . Classification / Policing / Drop Policy Queue i . Queue N 22 State p1 S
Enqueue Packet 3 (continued) t8 t1 t7 enq logic 1 R t2 t6 deq logic t5 S t3 t4 t7 deq logic enq logic 7 &p2 deq logic enq logic deq logic enq logic R S S compute path Queue 1 . Classification / Policing / Drop Policy Queue i . Queue N 23 State p2 S
Dequeue Packet 2 t8 1 R deq logic enq logic S t2 t6 t5 R t1 t7 t3 t4 7 &p2 deq logic enq logic deq logic enq logic S S compute path Queue 1 . Classification / Policing / Drop Policy Queue i . Queue N 24 State p2 deq logic enq logic S &p2
Prog. Drop Policy Extern interface to read buffer state 2 L 1 R t8 1 R t7 t2 t6 deq logic enq logic t5 Can implement WRED-like policies t4 S deq logic S compute path p4 drop Queue 1 . Classification / Policing / Drop Policy Queue i . Queue N 25 deq logic enq logic S p4 t3 7 2 &p2 &p3 3 &p1 enq logic t1 State
New Externs /* PIFO extern */ extern pifo T, M { pifo(bit 32 size); // constructor enq(in T rank, in M meta); // meta is optional deq(out T rank, out M meta); // meta is optional } // example instantiation pifo rank t, sched meta t (2048) p; /* Packet buffer extern */ extern buffer T { buffer(bit 32 num queues); // get the current size of a queue get q size(in T q id, out bit 32 size); } // example instantiation buffer qid t (1024) buf; 26
Simple Architecture User defined scheduling metadata parser Parser H, M (packet in b, out H hdr, out M user meta, inout std meta t std meta); control Ingress H, M, D (inout out D inout inout H hdr, sched meta, M user meta, std meta t std meta); Classification & Policing & Drop Policy Ingress Parser 27 scheduler MyScheduler D (in D sched meta); Ingress M/A control Egress H, M (inout H hdr, inout M user meta, inout std meta t std meta); control Deparser H, M (packet out b, in H hdr, in M user meta, inout std meta t std meta); Scheduling / Shaping Non-P4Programmable PRE / buffer Egress M/A Egress Deparser
Scheduler Block scheduler MyScheduler(in sched meta t sched meta) { /* Define PIFO tree nodes */ /* root node */ node strict priority { type scheduling; pifo rank t (2048) p; enqueue { } dequeue { } } /* shaping node */ node token bucket { WFQ type shaping; pifo rank t, sched meta t (2048) p; enqueue { } dequeue { } } /* Define the shape of the tree */ tree myTree { strict priority(), {wfq(), {token bucket(), {wfq()} } } table find path { } apply { find path.apply(); // apply the scheduling algorithm defined by the tree myTree.apply(leaf node); } } 28 strict token bucket WFQ
Example – Strict Priority /* root node */ node strict { type scheduling; // PIFO extern instantiation format: // pifo rank type (size in pkts) instance name; pifo rank t (2048) p; enqueue { rank t rank; rank sched meta.diffserv; // maybe use a table? p.enq(rank); } dequeue { rank t rank; p.deq(rank); } } 29 Strict low Class A high Class B
Example – Strict Priority with Starvation Prevention strict FCFS nodes prevent reordering of packets within a flow high Also useful for traffic that is sensitive to low token bucket jitter (e.g. voice) 30 FCFS FCFS Flow A Flow B
Example – Strict Priority with Starvation Prevention node token bucket { type shaping; pifo rank t, sched meta t (2048) p; enqueue { // declare registers: tb, last time, rate, max tokens @atomic { // atomically update last time and tb registers tb.read(0, old tokens); old tokens old tokens r * (now - lt); if (old tokens B) { old tokens B; } // compute new tokens and send time if (sm.pkt len old tokens) { new tokens old tokens – sm.pkt len; send time now; } else { new tokens 0; send time now (sm.pkt len – old tokens)/r; } tb.write(0, new tokens); } p.enq(send time, sm); } dequeue { bit rank t rank; p.deq(rank, sm); } } 31 strict high low token bucket FCFS FCFS Flow A Flow B
Example – Strict Priority with Starvation Prevention node FCFS { type scheduling; pifo rank t (2048) p; enqueue { bit rank t rank get time(); // extern function p.enq(rank); } dequeue { bit rank t rank; p.deq(rank); } } /* Define the shape of the tree */ tree myTree { strict priority(), {FCFS(), {token bucket(), {FCFS()}}} } apply { // path computation logic pifo id t leaf node; if (sched meta.flow id 0) { leaf node RIGHT; } else { leaf node LEFT; } // apply the scheduling / shaping algorithm myTree.apply(leaf node); } 32 strict high low token bucket FCFS FCFS Flow A Flow B
Example – Low Latency Traffic low latency WFQ across all queues until a particular queue exceeds a configured threshold at which point that queue is given strict priority until it has been sufficiently drained FCFS FCFS FCFS Flow A Flow B Flow C Classification sets desired queue size in sched meta 33
Example – Low Latency Traffic Ingress Logic: low latency buffer qid t (NUM QUEUES) buf; // instantiate buffer extern // get size of low latency queue sched meta.ll q size buf.get q size(LL Q ID); lookup queue.apply(); // set egress queue Root Node: node low latency { type scheduling; pifo rank t (2048) p; enqueue { if (sched meta.ll q size THRESH) { // Strict priority logic } else { // WFQ logic } p.enq(rank); } dequeue { rank t rank; p.deq(rank); } } 34 FCFS FCFS FCFS Flow A Flow B Flow C
Runtime Control The tree can be updated by the control plane, but not from within the P4 program Akin to adding table entries Will need P4Runtime support for tree configuration Tables and externs can be handled normally 35
Open Source Implementations NetFPGA Prototype (P4 Workshop Demo) PIFO paper ASIC design [1] PIFO Scheduler 8 7 4 3 0 descriptor & rank PIFO block diagram rank computation descriptor descriptor & metadata Queue 1 Input Packet . Classification Queue i . Queue N Packet Buffer Output Packet Rank Store (SRAM) Flow Scheduler (flip-flops) 5 3 2 A 6 4 2 B 8 5 4 C Increasing ranks C 3 B 1 A 0 Increasing ranks
Challenges with TM Programmability Scaling programmable designs to high rates and large sizes Bottleneck #1 – timing closure for scheduling decision logic Bottleneck #2 – memory bandwidth/interface to large packet buffer Abstractions to program the PIFO tree Limitations of the proposed model 37
Moving Forward Motivation from the community? Settle on architecture model Define language semantics Is a tree too low-level? Are there better abstractions? Improve open source HW implementations Open source simulation/emulation environment Bmv2? NS3? Other? Small P4 TM Working Group 38
References [1] Anirudh Sivaraman, Suvinay Subramanian, Mohammad Alizadeh, Sharad Chole, ShangTse Chuang, Anurag Agrawal, Hari Balakrishnan, Tom Edsall, Sachin Katti, Nick McKeown. "Programmable packet scheduling at line rate." Proceedings of the 2016 ACM SIGCOMM Conference https://cs.nyu.edu/ anirudh/pifo-sigcomm.pdf 39
Questions?
NetFPGA Prototype (P4 Workshop Demo) Parallel deterministic skip lists and register-based sorting cache PIFO Scheduler 8 7 4 3 0 BRAM based packet buffer descriptor & rank rank computation Top level PIFO block diagram Load Balancer descriptor & metadata Insertion Queue 1 Input Packet Register Cache Register Cache Skip List Skip List . Register Cache Skip List descriptor . Classification Register Cache Queue i . Queue N Selector 41 Removal Packet Buffer Output Packet
PIFO Paper ASIC Design [1] Flow scheduler Choose amongst head pkt of each flow Rank Store Store computed ranks for each flow in FIFO order PIFO blocks connected in a full mesh 64 port 10Gbps shared memory switch, 1GHz 1000 flows, 64K packets 42 PIFO block diagram Rank Store (SRAM) Flow Scheduler (flip-flops) 5 3 2 A 6 4 2 B 8 5 4 C Increasing ranks C 3 B 1 A 0 Increasing ranks
TM Model Limitations
Input vs Output Rate Limiting Goal of rate limiting: control the rate at which bytes are removed from the buffer and scheduled. Output rate limiting – direct approach, short and long term rate guarantees Input rate limiting – indirect approach, long term rate guarantees Strict Rate Limit FCFS 44 low Sent out at line rate high FCFS
Arbitrary reordering of buffered packets Pfabric scheduling order: SRPT with FIFO order within flows 7 6 8 9 p3 p2 p1 p0 45
Approximate Pfabric SRPT 98 R 76 R L FCFS FCFS 0 p0 66 7 9 8 p0 p1 p2 p3 p3 46 2 p3 Final Scheduling Order: p3 p2 p0 p1 Pfabric Scheduling Order: p0 p3 p2 p1 1 p2 0 p1
Soft Rate Limiting “Rate limit flow A to 2 Gbps only if the egress link is congested” Hard vs. Soft Rate Limiting: Hard – hold packets until time to send (non work conserving) Soft – packets can be sent if bandwidth is available (work conserving) 47
Soft Rate Limiting t8 default t2 t6 deq logic enq logic t1 t7 t5 S t3 t4 t7 deq logic enq logic 7 &p2 deq logic enq logic deq logic enq logic R S S S &p2 compute path Queue 1 . Classification / Policing / Drop Policy Queue i . Queue N 48 State p2 What if there are multiple shaping nodes?
Ingress Logic Limitations Drop policy: No drop head algorithms (unless using speedup) No algorithms that must programmatically update state in the packet buffer 49
Best Effort for Traffic Shaping Shaping operations cause conflicts Scheduling and shaping transactions commit at same time Resolved in favor of scheduling transactions Rationale: scheduling algorithms must run at line rate, shaping algorithms can afford to be delayed a few cycles 50
Enqueue Conflict 2 L t8 2 L t2 t6 deq logic enq logic t5 S t3 t4 t7 t4 3 5 &p4 &p1 deq logic enq logic t1 t7 2 7 &p2 &p3 deq logic enq logic deq logic enq logic R S S L compute path &p4 p4 p4 Queue 1 p2 . Classification / Policing / Drop Policy Queue i p4 . Queue N 51 p1 State p3 S
Constraints 1 enqueue and 1 dequeue at each level of tree per time slot Need to deal with wrap around of rank values 52
Scheduling Demonstration
Scheduling Scheduling Goal: Enqueue 3 packets Dequeue 3 packets 54 Scheduling Scheduling Class A Class B
Enqueue Packet 1 t8 2 L t7 t2 t6 deq logic enq logic t1 t5 t4 S t3 3 &p1 deq logic enq logic S S compute path L &p1 p1 p1 Queue 1 . Classification / Policing / Drop Policy Queue i . Queue N 55 deq logic enq logic State p1
Enqueue Packet 2 t8 2 L 1 enq logic R t7 t2 t6 deq logic t5 t4 S 3 &p1 S R compute path &p2 p2 Queue 1 p2 . Classification / Policing / Drop Policy Queue i . Queue N 56 deq logic enq logic S p2 t3 7 &p2 deq logic enq logic t1 State p1
Enqueue Packet 3 2 L 1 enq logic R t8 1 R deq logic t7 t2 t6 t5 t4 S 3 &p1 2 enq logic &p3 S R &p3 p3 Queue 1 p2 . Classification / Policing / Drop Policy Queue i p1 . Queue N 57 deq logic S compute path p3 t3 7 &p2 deq logic enq logic t1 State p3
Dequeue Packet 3 1 R 2 L t8 1 R t7 t5 R S t2 t6 deq logic enq logic t4 t3 7 2 &p2 &p3 3 &p1 deq logic enq logic t1 deq logic enq logic &p3 S S compute path Queue 1 p2 . Classification / Policing / Drop Policy Queue i . Queue N 58 p1 State p3
Dequeue Packet 2 t8 1 R 2 L t7 t5 R S t2 t6 deq logic enq logic t4 3 &p1 t3 7 &p2 deq logic enq logic t1 deq logic enq logic &p2 S S compute path Queue 1 p2 . Classification / Policing / Drop Policy Queue i . Queue N 59 State p1
Dequeue Packet 1 t8 2 L t7 t5 L S t2 t6 deq logic enq logic t1 t4 t3 3 &p1 deq logic enq logic deq logic enq logic &p1 S S compute path Queue 1 . Classification / Policing / Drop Policy Queue i . Queue N 60 State p1
Additional Examples
Example – Shortest Remaining Processing Time (SRPT) Ingress Logic: table lookup queue { key { sched meta.rank: ternary; } actions { set queue; } size 1024; default action set default queue; } // apply block sched meta.rank hdr.bytes remaining; lookup queue.apply(); Scheduling Node: node SRPT { type scheduling; pifo rank t (2048) p; enqueue { p.enq(sched meta.rank); } dequeue { rank t rank; p.deq(rank); } } 62 SRPT Flow A . Flow X
Example – Weighted Fair Queueing (WFQ) node wfq { // PIFO extern instantiations pifo rank t (2048) p; // virtual time state to track last rank dequeued register rank t (1) virtual time; enqueue { // declare weight & last finish reg arrays // read virtual time virtual time.read(0, vt); @atomic { last finish.read(flow id, lf); if (lf ! 0) { // seen this flow before start max(vt, lf); } else { start vt; } weight.read(flow id, fw) lf start sched meta.pkt len/fw; last finish.write(flow id, lf); } rank start; p.enq(rank); } dequeue { bit 16 rank; p.deq(rank); virtual time.write(0, rank); } } 63 WFQ Class A Class B Class C
Example – Strict Priority w/ WFQ Strict /* Define the shape of the tree */ tree myTree { strict(), {WFQ(), WFQ()} } table find path { } apply { // path computation logic find path.apply(); // apply the scheduling / shaping algorithm myTree.apply(leaf node); } low WFQ WFQ Class A 64 high Class B Class C Class D
Example – Bandwidth Reservations (Min Rate Guarantees) Strict (Min Rate) Root computes rank as either 0 (below min rate) or 1 (above min rate) FCFS used to prevent reordering within a flow 65 FCFS FCFS FCFS Flow A Flow B Flow C
More Examples Heavy hitter detection to map HH flows to separate queue and assign their packets lower priority Scheduling transaction computes level of burstiness for each flow and assigns priority as the inverse of the burstiness – encourages smooth traffic 66