Monitoring network services

1 Monitoring network servicesPavle Vuletić SGA-2 JRA2T4 T...
Author: Timothy Casey
0 downloads 3 Views

1 Monitoring network servicesPavle Vuletić SGA-2 JRA2T4 Task Leader, GÉANT Project SIG-PMV workshop, Amsterdam May 17th 2017

2 Goal of our task Measure the performance of network services for each user, observe user’s experience. JRA2T4 focuses on L2 and L3 network services in multiple domains (point- to-point and multipoint): e.g. MPLS based L3VPN, L2VPN, Ethernet based services (VLANs, PB, PBB, PTT-TE), and all services created as a chain of these and other services (SFC) regardless of the way services are provisioned (e.g. SDN), and NFV. The aim is to create an adaptable network service monitoring capability (NetMon) that will not be tailored for a single specific network service or equipment vendor, but would be used for various current and future services NetMon will provide: real-time feedback to network operations personnel or users, determine whether those services are performing to spec (SLA verification), and if not, initiate an automated analysis to localise the fault, and notify the appropriate agent to take corrective action.

3 Why is per-user service monitoring important?Users’ traffic is multiplexed over providers physical links Even when all interfaces are UP and links are uncongested some users might have service issues – can we detect this before user complains? Monitor what user really gets Provide and verify per-user SLA

4 Network service Key performance metricsMEF metrics (10.3) One-way Frame Delay One-way Mean Frame Delay One-way Frame Delay Range One-way Inter-Frame Delay Variation One-way Frame Loss Ratio One-way Availability One-way Resiliency One-way Group Availability All these metrics can be obtained from a tool which monitors loss, delay jitter (e.g. owamp) or by simple timestamping and comparing Y.1540 (IP) metrics IP packet transfer delay Mean, min, max End-to-end 2-point IP packet delay variation IP packet error ratio IP packet loss ratio Spurious IP packet rate IP packet reordered rate IP packet severe loss block ratio IP packet duplicate ratio Replicated IP packet ratio Capacity metrics Capacity, transfered bits available bandwidth, section capacity, variability of capacity IP service availability

5 Network monitoring approaches (1)Passive (SNMP, reading from NE, reading from EMS) Suitable for capacity, used bandwidth and packet error metrics (read from devices) Suitable in single-domain environment No additional traffic, no (significant) interference with the other network traffic Support for fault localization Not suitable for delay/jitter/loss metrics Problems in multiple domains, Problems in multi-vendor environments Problems with services which dynamically change path (e.g. MPLS based VPNs)

6 Network monitoring approaches (2)Active (injecting special purpose network traffic) Suitable for end-to-end delay/jitter/loss metrics Suitable for monitoring in multiple domains No problems with dynamic path changing Not suitable for capacity and available bandwidth monitoring (very intrusive and not reliable results) Injected traffic might not have the same conditions as the monitored service traffic Not suitable for chained services and fault localization Can be done: From NE – there are methods only for specific network services (e.g ag Connectivity Fault Management - CFM) From dedicated external devices – OK for all services except p2p l2 services (issues with the place to inject probe traffic)

7 Network monitoring approaches (3)Out-of-band/Network visibility Became popular recently (Brocade Packet Brokers and Visibility Manager, Ixia IxVision architecture - taps and packet brokers, Accedian smart SFPs and Flow Broker Architecture,..) Allows various types of analyses (performance, security, per flow, per service instance,...) Allows all types of performance metrics for all types of services (just filter the appripriate field in the header) Enables fault localization There are virtual taps for „inside data centre“ monitoring Multiple copies of tapped traffic have to be transported to central facility – smart sampling is required if central facility is far from taps Not very suitable for WANs – How to transport tapped traffic and not create a copy of the existing network? Target use: data centres, security verification, mobile network monitoring. Privacy issues!!!

8 Multi-domain multi-vendor multi-service environmentCannot rely on passive approach (access to other domain data) Cannot rely on non-interoperable single-vendor mechanisms (Cisco IP SLA, Juniper RPM) Cannot rely on single-technology mechanisms (802.1ag CFM) Can be done with multihomed active probes (done previously with SQM), but no fault localization (two-way metrics) New approach is needed

9 Monitoring service performance with fault localizationOn which link in which service instance is a performance problem? End-to-end probing is not sufficient Hybrid approach is needed: active + capturing Monitoring on points inside the network is needed as well Monitoring zone concept Can be generalized to SFC type of services! Image taken from: Ericsson Diamond: https://pdfs.semanticscholar.org/0119/099638d68a0836d55d7de0dfc pdf

10 Hybrid approach challengesHow to detect the packets belonging to the same service instance How to detect what is really happening with the packet (lost? Duplicated? passed?) Problems: Dynamic paths Various service types L2VPN L3VPN Changing service IDs MPLS labels Two in stack Single VLAN IDs Something else D2 Loss of service! D1 P D4 B P P D3 P P P A Packet went to D2 and D3! P

11 MPLS VPNs – different flavoursOn Juniper devices 3 different types of L2VPN: Circuit Cross Connect (RSVP signalling, separate LSP per VC, one MPLS label) Martini – RFC 4447 (LDP signalling , two MPLS labels, inner label VC distinguisher) Kompella – RFC 6624 (BGP signalling, two MPLS labels, inner label VC distinguisher) L3VPN (MP BGP signalling, two MPLS labels, inner label VC distinguisher) https://mellowd.co.uk/ccie/?p=4428

12 CCC L2VPN VLAN tag is not preserved inside the networkSingle MPLS label changes at every hop (users circuit is mapped onto a separate LSP) VLAN-label mapping is dynamic No real service ID Very difficult to detect the service instance without reading network element data CE-PE PE-PE

13 Martini/Kompella VPLS L2VPNVLAN is preserved inside the encapsulated frame Inner VPLS label is a service ID on the path through the network Inner labels dynamically allocated CE-PE PE-PE

14 MPLS L3VPNs Inner MPLS label is service IDDefault (per-prefix) and per-VRF mode Inner labels dynamically allocated (transfered by MPBGP) CE-PE PE-PE

15 Ethernet technologies (802.1ah, 802.1ad, 802.1ay)Provider Backbone Bridging (PBB) or MAC-in-MAC (802.1ah) QinQ (802.1ad) PBB-TS (802.1ay) There are serviceIDs in packets Conclusion: majority of technologies have fixed service ID in packet, although there are exceptions. New technologies will have also some service ID in packets in order to be scalable Service ID in packet is allocated dynamically

16 Our architecture Hybrid approach: active probing + packet capturing3 modes of operation Mode 1: Active end-to-end probing – no fault localization Mode 2: Active end-to-end probing with probe packet capturing – with fault localization Mode 3: User traffic capturing – with fault localization User can choose the mode of operation

17 Mode 1: Active end-to-end probingAt each PE device put a Monitoring Agent (MA) MA is a small device (e.g. RPi) with linux on it, doing owamp MA is capable to monitor multiple services at the same time with the overlapping address spaces (using linux netnamespace) MA is dynamically configured and started from the central Monitoring Controller based on the Service inventory MA sends results to the Monitoring Repository Packets between MA devices run through each service instace separately

18 Mode 2: Active end-to-end probing with probe packet capturingThe same MA devices are used at the PE devices Packet capturers inside the network Packet capturers capture only owamp probe packets owamp probe packets are modified in order to transport service related data (solving the problem of service distinguishers on the path) All data from packet capturers is sent to Monitoring correlators which calculate per segment performance data Suitable for ALL network services: probes can be outside the providers network, probe packets can be tied to the service ID regardles of the service type)

19 Modifications to owampowamp protocol (RFC section 4.1.2) has the option to add the Packet Padding of variable length to the test packets. RFC specifies that "Packet Padding in OWAMP-Test SHOULD be pseudo- random". We added service and monitoring related fields into the padding Service ID monitoring correlator address

20 OWAMP packet in CCC L2VPN

21 Mode 3: Capture all or sampled user traffic (per flow, per application,...) and analyze timestamps Challenges: services like CCC VPN where there is no unique service ID path detection (is packet lost or it has just changed its path?) Compressing captured data (timestamp+serviceID+...) Detect the beginning/end of packet batch (draft-ietf-ippm-alt-mark-04) A mode to help network debugging – feeding more network data to the system in order to detect the packets belonging to the same service Plan: be ready for 100G links

22 From our GTS experimentTraffic flows from through 4 GTS PoDs in different countries (real delays)

23 Are we nuts? Maybe... But: The approach is increasingly interestingSimilar systems appeared recently: Ixia IxVision Brocade Visibility architecture Accedian FlowBrocker

24 Ixia IxVision Taps, vTaps, Packet Brokers and Analisys tools

25 Brocade visibility architecture

26 Accedian Flow Brocker ArchitectureAccedian has their own performance modules - packet capturers (smart SFPs, NIDs) which do the timestamping and filtering (packet slicing) VCX controls capturers, sets filters and gets the captures traffic Brokered flows are sent to further analysis depending on the purpose Brokered flows << 10% of the original bandwidth

27 Accedian use cases and filteringVideo over LTE Video QoE Financial Compliance and Trade flow analysis Security and Policy

28 Current status Few weeks before the prototype is ready and presentable Performance verification and fault localization of basic network services (L2 and L3 point-to-point and multipoint VPNs), per-service instance monitoring and SLA verification, GUI/dashboards for monitoring parameters CY2017 CY2018 Integration with other OSS/BSS components (TTS or alarm management systems), the use of dedicated capturing hardware for high-speed link monitoring, Monitoring chained services Prototype ready PVM v1.0 PVM v1.0 PVM v1.0 PVM v1.0 PVM v2.0 PVM v2.0 Design PVM v2.0 PVM v2.0 You are Here May 15, 2017

29 Thanks to: David Henrik Jerry Kostas Pascal Tobias

30