國立清華大學-大學學術追求卓越發展延續計畫

計畫簡介

計畫成員

研究成果

會議紀錄

回總計畫

English Version

ABSTRACT

In this sub-project, we will explore the designs of high-speed switches and extend our results on Gigabit Ethernet Service Switch (GESS) to Gigabit Ethernet Multi-Service Switch (GEMSS).

1. Design and implementation of high-speed switches that scale with the speed of fiber optics.

One of the main objectives of this sub-project is to design and implement high speed switches that scale with the speed of fiber optics. Our approach will be based on the load-balanced Birkhoff-von Neumann switch architecture developed in the first phase of our PPAEU project.

2. Design and implementation of high-speed multi-service switch.

The second main objective of this sub-project is to design and implement policy-based high speed multi-service switch based on network processor hardware platform to provide gigabit level multi-service, such as VPN, QoS mobility management, Ipv4/Ipv6 translation, intrusion prevention, content filtering, anti-virus on the fly, etc, over LAN/MAN environments. Our approach will be based on the network processor-based service switch platform developed in the first phase of our PPAEU project.

CONTENTS

In this sub-project, we will explore the designs of high-speed switches and extend our results on Gigabit Ethernet Service Switch (GESS) to Gigabit Ethernet Multi-Service Switch (GEMSS).

1. Design and implementation of high-speed switches that scale with the speed of fiber optics.

Switch Architecture

Most of the switches in the market (e.g., Ether switches) are based on the so-called shared memory switch architecture. In such a switch architecture, packets are stored and forwarded in a common shared memory. As the speed of fiber optic advances, the memory access speed becomes a bottleneck for the share memory switch architecture. With the current memory technology, a shared memory switch is limited to the speed of 50 Gbits/sec. To go beyond such a speed, one has to use parallel-buffered switch architecture to acquire the needed speedup. One common approach, known as the input-buffered switch architecture, is to have parallel buffers in front of a (crossbar) switch fabric. In an input-buffered switch, time is synchronized and slotted so that packets of the same size can be transmitted in parallel. As there are multiple buffers (memories), the key problem of input-buffered switch is to solve the conflict of parallel memory access.

Traditionally, conflict resolution in an input-buffered switch is solved by finding a matching between inputs and outputs per time slot (see e.g., [1.1], [1.6], [1.7], [1.8], [1.9], [1.10], [1.11], [1.12]). Two steps are needed for finding a matching:

l Communication overhead: one has to gather the information of the buffers at the inputs.

l Computation overhead: based on the gathered information, one then applies a certain algorithm to find a matching.

Most of the works in the literature pay more attention to reducing the computation overhead by finding scalable matching algorithms, e.g., wavefront arbitration in [1.12], PIM in [1.1], SLIP in [1.9], and DRRM in [1.8]. However, in our view, it is the communication overhead that makes matching per time slot difficult to scale. To see this, suppose that there are N inputs/outputs and each input implements N virtual output queues (VOQ). If we use a single bit to indicate whether a VOQ is empty, then we have to transmit N bits from each input (to a central arbiter or to an output) in every time slot. For instance, transmitting such N bit information in PIM and SLIP is implemented by an independent circuit that sends out parallel requests. Suppose that the packet size is chosen to be 64 bytes. Then building a switch with more than 512 inputs/outputs will have more communication overhead than transmitting the data itself.

Figure 1.1. The load balanced Birkhoff-von Neumann switch with one stage buffering.

It would be ideal if there is a switch architecture that yields good throughput without the need for gathering traffic information (no communication overhead) and computing connection patterns (no computation overhead). Our recent works on the load-balanced Birkhoff-von Neumann switches (see e.g., [1.2], [1.3], [1.4]) shed some light along this direction. The switch architecture of the load balanced Birkhoff-von Neumann switch consists of two crossbar switch fabrics (see Figure 1.1) and parallel buffers between them. In a time slot, both the crossbar switch fabrics sets up connection patterns corresponding to permutation matrices that are periodically generated from a one-cycle permutation matrix. By so doing, the first stage performs load balancing for the incoming traffic so that the traffic coming into the second stage is uniform. As such, it suffices to use the same periodic connection patterns as in the first stage to perform switching at the second stage. In the load balanced Birkhoff-von Neumann switch, there is no need to gather the traffic information. Also, as the connection patterns are periodically generated, no computation is needed at all. More importantly, it can be shown to achieve 100% throughput for any non-uniform traffic under a minor technical assumption. However, the main drawback of the load balanced Birkhoff-von Neumann switch is that packets might be out of sequence. To solve the out-of-sequence problem in the two-stage switches, two approaches have been proposed. The first one uses sophisticated scheduling in the buffers between the two switch fabrics (see e.g., [1.3], [1.13]) and hence it may require complicated hardware implementation and non-scalable computation overhead. The second one is to use the rate information for controlling the traffic entering the switch (see e.g., [1.4]). However, this requires communication overhead and it also does not adapt too well to large traffic fluctuation.

In our recent work [1.5], we propose a very simple method to solve the out-of-sequence problem in the load-balanced Birkhoff von-Neumann switch without non-scalable computation and communication overhead. The switch architecture, called the mailbox switch, has the same architecture as the load balanced Birkhoff-von Neumann switch in Figure 1.1. Instead of using an arbitrary set of periodic connection patterns generated by a one-cycle permutation matrix, the key idea in the mailbox switch is to use a set of symmetric connection patterns. As an input and its corresponding output are usually built on the same line card, the symmetric connection patterns set up a feedback path from the central buffers (called mailboxes) to an input/output port. Since everything inside the switch is pre-determined and periodic, the scheduled packet departure times can then be fed back to inputs to compute the waiting time for the next packet so that packets can depart in sequence. Thus, the communication overhead incurred by this is the transmission of the information of the packet departure time, which is a constant independent of the size of the switch. On the other hand, the computation overhead incurred by this is the computation of the waiting time, which also requires only a constant number of operations.

Instead of using the two-stage construction of the load-balanced Birkhoff-von Neumann switch, we will consider the folded version as shown in Figure 1.2. In the folded version, there is only one switch fabric. The central buffer in the load-balanced Birkhoff-von Neumann switch are distributed among the linecards. As such, packets arriving at the linecard go through the switch fabric twice: one for load balancing and one for switching.

Figure 1.2. The folded version of the mailbox switch.

Switch Fabric

There are two crossbar switch fabrics in the load-balanced Birkhoff-von Neumann switch (one for the folded version). Now we describe how the connection patterns of these two crossbar switch fabrics are set up. In every time slot, both crossbar switches in Figure 1.1 have the same connection pattern. During the t^th time slot, input port i is connected to the output port j if

(1)

In particular, at t=1, we have input port 1 connected to output port 1, input port 2 connected to output port N-1, ..., and input port N connected to output port 2. Clearly, such connection patterns are periodic with period N. Moreover, each input port is connected to each of the N output ports exactly once in every N time slot. Specifically, input port i is connected to output port 1 at time i, output port 2 at time i+1, ..., output port N at time i+N-1. Also, we note from (1) that such connection patterns are symmetric, i.e., input port i and output port j are connected if and only if input port j and output port i are connected. As such, we call a switch fabric that implements the connection patterns in (1) a symmetric Time Division Multiplexing (TDM) switch.

Figure 1.3. A two-stage construction of an NxN symmetric TDM switch fabric.

Even though an N×N symmetric TDM switch can be implemented by an N×N crossbar switch, one of our key innovations show that an N×N symmetric TDM switch can be recursively constructed with O(Nlog N) complexity. In Figure 1.3, we show a two-stage construction of an N×N symmetric TDM switch (with N=pq). The first stage consists of p q×q symmetric TDM switches (indexed from 1, 2, ..., p) and the second stage consists of q p×p symmetric TDM switches (indexed from 1, 2, ..., q). These two stages of switches are connected by the perfect shuffle, i.e., the l^thoutput of the k^th switch at the first stage is connected to the k^th input of the l^th switch at the second stage. Also, index the N inputs and outputs from 1 to N. The N inputs of the N×N switch are connected to the inputs of the switches at the first stage by the perfect shuffle.

The symmetric TDM switches at these two stages are operated at different time scales. The connection patterns of the symmetric TDM switches at the second stage are changed every time slot. However, the connection patterns of the symmetric TDM switches at the first stage are changed every frame with each frame containing p time slots. To be specific, we define the m^th frame of the k^th switch at the first stage in Figure 1.3 to be the set of time slots {(m-1)p+k, (m-1)p+k+1, ..., mp+k-1}. Then every symmetric TDM switch at the first stage is operated according to its own frames. Note that the p symmetric TDM switches at the first stage do not change their connection patterns at the same time as the m^th frames of these switches contain different sets of time slots.

In Figure 1.4, we show how one builds a 256×256 symmetric TDM switch with 2×2 switches via the recursive construction.

Figure 1.4. A 256x256 symmetric TDM switch with 2x2 switches via the recursive construction.

Another method of building the symmetric TDM switch fabric is to build a 16×16 switch fabric first and then use the recursive construction discussed above to build a 256×256 switch fabric. In Figure 1.5, we show the 3D-construction of the 256×256 switch fabrics by 32 16×16 switch fabrics.

Figure 1.5. A 3D-construction of a 256x256 symmetric TDM switch fabric.

Linecards

In Figure 1.6, we show the architecture for a linecard. It consists of the following components: PHY, MAC, NP(network processor), forwarding table, RAM and a buffer manager chip. The PHY chip performs the physical layer function, i.e., converting the optical/electronic signals into bit streams and packets. The MAC chip is for the medium access control needed for the Ethernet. Both PHY and MAC are commercially available for 1 Gbits/sec Ethernet and 2.5Gbits/sec OC48. For this, we will seek help from ITRI/CCRL in implementing the standard PHY and MAC. A more challenging way is to build PHY and MAC for OC768 that demuxs a single fiber into 16 OC48 channels. This will require building high speed optical components of our own as the technology is still not available in Taiwan.

Instead of designing an ASIC for packet processing, we will use (commercially available) network processors for design flexibility. The network processor needs to locate the output port of a packet from the forwarding table, perform segmentation and reassembly, collect statistics, and detect faults.

Our focus of the linecard is on the buffer manager chip. As mentioned before, every packet needs to go through the switch fabric twice. As such, there are two buffers in a linecard: load-balancing buffer (FIFO queues) and central buffer (mailboxes). It should be straightforward to implement FIFO queues for the load-balancing buffer. However, to manage the mailboxes is more challenging. For an N×N switch, each mailbox contains N bins (indexed from 1...N), and each bin contains F cells (indexed from 1, ...,F). Each cell can store exactly one (segmented) packet. Cells in the i^th bin of a mailbox are used for storing packets that are destined for the i^thoutput port of the second switch. The buffer manager chip is responsible for placing a packet into the appropriate cell so that packets can be departed in order (see [1.5]).

Figure 1.6. The architecture for a linecard.

2. Design and implementation of high-speed multi-service switch.

GESS Switch Architecture

In the first phase of our PPAEU project, we have developed a chassis-based 16Gbps Gigabit Ethernet Service Switch (GESS) as shown in Figure 1.7. In this GESS, eight hardware gigabit modules and a switching fabric module are designed. For each of the hot-swappable module, two network processors are designed to process the data path and one MIP CPU with VxWork RTOS is designed to handle the control path as shown in Figure 1.8. We have developed a QoS module to provide QoS service, such as bandwidth management, for different applications on the gigabit network, such as VOIP, video conferencing, video streaming, etc.

Figure 1.7. The chassis based gigabit Ethernet service switch.

Figure 1.8. The gigabit service module.

GEMSS

The Gigabit Ethernet Multi-Service Switch (GEMSS) hardware architecture will be based on that of the GESS. To provide the policy-based gigabit multi-service in a real time level, we need to develop new technologies on processing the packets on the fly, such as VPN, QoS mobility management, Ipv4/Ipv6 translation, intrusion prevention, content filtering, anti-virus on the fly, etc. For implementing these services on the GEMSS, we also need to develop new service hardware module with new layer-7 processing engine (FPGA or ASIC). The deployment of the GEMSS will first be at the gigabit campus networks of NTHU and NCTU to provide multi-service as shown in Figure 1.9.

For QoS mobility management, it is critical and important to provide the QoS service while the mobile devices, such as PDA, notebook, 3G handsets, etc., are roaming over the campus network. Typically, it is not easy to maintain the TCP-level (or upper layer) QoS for such roaming even when some network(IP)-level handoff mechanisms are implemented. With the QoS mobility management technology, we can efficiently migrate the TCP QoS among GEMSSes to provide the QoS mobility services over the gigabit campus network.

For the Ipv4/Ipv6 translation service, we will design and implement an Ipv4/Ipv6 translator to connect the Ipv4 backbone network and some native Ipv6 networks. As Ipv6 is one of the emerging technologies for the next generation Internet and some Ipv6 testing networks have been developed and deployed, it is desirable to provide this service within the campus network. This service is also important for the deployment of Ipv6 wireless LAN, such ad hoc networks, or large-scale sensor networks.

The intrusion prevention service is to protect the campus network from emerging network attacks, such as DoS/DDoS and signature-based/protocol anomaly attacks. To do this in gigabit environment, we need to develop new string matching algorithm to provide fast and efficient prevention from the network intrusions. We also need to develop string matching engines (FPGA or ASIC) as well as hardware module to achieve this goal.

For the anti-virus on the fly service, the first challenge is how to detect the virus on the packet level instead of on the file level as traditional schemes. For the traditional file-based anti-virus, we need first receive the complete file and then scan the file to detect the virus. This is not practical in the coming broadband multimedia environment as more and more huge files will be delivered over the network. It is bandwidth and time wasting for the traditional anti-virus check. To design the anti-virus on the fly, we need to develop the virus DNA first to reduce the virus pattern, and also we need to develop new packet processing technology to handle the out-of-order packets, as well as fast and efficient DNA matching algorithm.

Figure 1.9. The deployment of the GEMSS on the gigabit campus network.

REFERENCES

[1.1] T. Anderson, S. Owicki, J. Saxes and C. Thacker, “High speed switch scheduling for local area networks,” ACM Trans. on Computer Systems, Vol. 11 , 1993, pp. 319-352.

[1.2] C. S. Chang, D. S. Lee and Y. S. Jou, “Load balanced Birkhoff-von Neumann switches, part I: one-stage buffering,” Computer Communications, Vol. 25, pp. 611-622, 2002

[1.3] C. S. Chang, D. S. Lee and C. M. Lien, “Load balanced Birkhoff-von Neumann switch, part II: Multi-stage buffering,” Computer Communications, Vol. 25, 2002, pp. 623-634

[1.4] C. S. Chang, D. S. Lee, and C. Y. Yue, “Providing guaranteed rate services in the load balanced Birkhoff-von Neumann switches,” Proceedings of IEEE INFOCOM, 2003.

[1.5] C. S. Chang, D. S. Lee and Y. J. Shih, “Mailbox switch: a scalable two-stage switch architecture for conflict resolution of ordered packets,” preprint, 2004.

[1.6] J. Dai and B. Prabhakar, “The throughput of data switches with and without speedup,” Proceedings of IEEE INFOCOM, Tel Aviv, Isreal, March, 2000, pp. 556-564

[1.7] M. J. Karol, M. G. Hluchyj, and S. P. Morgan, “Input Versus Output Queueing on a Space-Division Packet Switch,” IEEE Trans. Commun., Vol. COM35, NO.12, Dec. 1987.

[1.8] Y. Li, S. Panwar and H. J. Chao, “On the performance of a dual round-robin switch,” Proceedings of IEEE INFOCOM, 2001, pp. 1688-1697.

[1.9] N. McKeown, “Scheduling algorithms for input-queued cell switches,” PhD Thesis. University of California at Berkeley, 1995.

[1.10] N. McKeown, V. Anantharam and J. Walrand, “Achieving 100% throughput in an input-queued switch,” Proceedings of IEEE INFOCOM, 1996, pp. 296-302.

[1.11] A. Mekkittikul and N. McKeown, “A practical scheduling algorithm to achieve 100% throughput in input-queued switches,” Proceedings of IEEE INFOCOM, 1998.

[1.12] N. F. Huang and S. M. Zhao, “A Novel IP Routing Lookup Scheme and Hardware Architecture for Multi-Gigabit Switch Routers,” IEEE Journal of Selected Areas on Communications (IEEE JSAC), Vol. 17, No.6, June 1999, pp.1093-1104.

[1.13] N. F. Huang and T. Liu, ” A Novel URL Lookup Engine for Content-Aware Multi-Gigabit Switches,” submitted for publication.

[1.14] N. F. Huang and T. Liu, ” A Fast String Matching Algorithm for Network Processor based Intrusion Detection System,” submitted for publication.