1 Using PSM to Diagnose Performance BottlenecksStephen Vaillancourt Fellow - PTC Technical Support Tim Atwood Senior Technical Consultant - PTC Enterprise Deployment Center June, 2013
2 Agenda Why Performance Matters What is PTC System Monitor (PSM)? Reasons to install and use PSM PSM Technology from dynaTrace: PurePaths How to start diagnosing Performance Problems Using PSM How to use PSM Drill into a Performance Problem Spotting Common Performance Problems using PSM How to Check the KB for Solutions How Export and Send Data to PTC Takeaways: 3 reasons to install/use PSM 2 or 3 ways to start diagnosing problems in PSM X ways to drill down into a problem How to gather data for further analysis Waiting is frustrating, demoralizing, aggravating, annoying, time consuming, and incredibly expensive. — FEDEX Commercial
3 Facts and Myths about Windchill PerformancePerformance problems are the number one cause of Patch requests & escalated accounts at PTC Unaddressed performance problem will eventually lead to escalation and very unhappy users and their managers Defining an issue and obtaining diagnostic data is a major contributor to overall turnaround time when diagnosing a performance problem Myths Increasingly powerful computers means I can purchase hardware to solve my performance problems Only experts can diagnose Windchill performance problems Logs are the only source of information about past performance problems It’s not possible to continuously and thoroughly monitor a production system while retaining adequate performance Turnaround time for issues Amount of extra time just to define the issue and gather data How do I know whether there really is a problem? Myth: can’t continuously monitor a production system Myth: only experts can diagnose Windchill performance problems Myth: logs are the only source of information about past performance
4 Problems with Monitoring & Tuning with WindchillWindchill is considered a “black box” by many Administrators Tuning and monitoring stability is often “reactive” versus “preventative” due to a lack of standardized procedures, monitoring and diagnostics Troubleshooting can be difficult as there are many “moving parts” involved Application, Relational Database, Operating System Resources, Storage, etc. Advanced troubleshooting knowledge often required due to lack of deployed tools Multiple diagnostic efforts frequently occur before the correct information is acquired Need to reproduce an issue several times in order to log it properly Built in monitoring and diagnostic tools within Windchill has improved in 10.x but is light. Troubleshooting is often reactive VS proactive prevention, logging levels must be tweaked and issues reproduced, etc. Sudden Issue Reproduce with Logging Resolve
5 Sample Customer Results Using PTC System MonitorInstalled or in 75% of PTC’s Largest Customers Many have corrected previously unknown performance issues within days of installation Average of 4-8 hour effort reduction for performance cases versus cases where PSM was not used Compuware: Mean time to resolve performance issues reduced by 63% on average with dynaTrace Results will vary from implementation to implementation, however these value notes can serve as a general guide/ view into value.
6 Summary of Key Attributes of PSMFree of charge (server-side monitoring) Available to customers with active maintenance Entitlement Produced, packaged and maintained by PTC PTC Technical Support Engineers trained on PSM The new standard for PTC performance troubleshooting Support Shipped with monitoring templates for critical KPIs Customizable alerts ( and in-product) Real time dashboards for system health Monitoring Always on logging – reduction in need to reproduce Trace an issue through multiple Windchill layers Accelerates root cause analysis by PTC Support Diagnostics Monitor multiple Windchill servers with a single instance Optional monitoring of non-PTC environments Optional “User Experience Management” (UEM) Can support and diagnose customizations Expandability Summary (tweaked by atwood) Skip most of this live, 4 points to make: free, TS is ready to use it, always on, customization support
7 How Does PTC System Monitor Work?Packaged dynaTrace software that uses PurePath technology Client Web Server Application Tier: Java or .NET Database Optional Instrumentation (UEM or AJAX edition) PurePath Collector dynaTrace Client Performance Warehouse This in a nutshell is what dynaTrace does for you, it makes that connection (show path). The red line is the transaction that is running through the system. dynaTrace tracks this with its PurePath technology. PSM is a server-side monitor that automatically instruments the application tier (JVM or CLR), not the database or web server. There is optional instrumentation for the client: User Experience Monitoring (UEM) that can be purchased separately from dynaTrace, or a free AJAX edition that has more limited capability. dynaTrace Server Sessions Store Exported Session Offline Session Analysis by Tech Support or others
8 Using PSM to find Performance ProblemsIs there a problem? Where is the Problem? Methodology depends on administrator preference and what information is available: Proactive Situational Architectural Responsive Baselines User Experience Monitor (optional) Dashboards: System Health, MS Status… User Tracking, Queue Monitoring, Operations… Transaction Flow, Host Monitoring… Functional Health, Incidents, Logging Comparison, Automated and Manual Show Network and Client bottlenecks Proactive: check the dashboards Situational: functional or location-based method based on business transactions or other filters Architectural: use tier-based (ldap, application, DB) and process-centric approach (JVM & O/S health, Start Center quick picks) Responsive: look for errors thrown by the code Baselines: use automatic and manual options to see changes in performance (optional) UEM allows focus on larger scale to find bottleneck (client/network/server) and then drill down
9 Monitoring: System HealthWhen you are proactively checking the system… Health Indicators highlight for details Response Times Response times for Web Requests (BT green-yellow-orange-red) Drill down into orange/red web requests Health Indicators (same approach as for Incident Alerts) Total Transaction Times (how busy is the system in general) Drilldowns dashlets: Choose a Time Interval, can add further filters (BT, host, application, agent group, etc) using dashlet properties Transaction Time (Web Request or RMI)
10 Monitoring: MS Status KPIs Response Times Analysis KPIs (look for unbalanced active contexts, memory usage) User Activity Breakdown (see also User Tracking dashboard) Web Requests (response times) Response Time Hotspots average response time should usually be <200 ms Hotspots by API can show JDBC, JNDI, Cache problems Tip: if you are diagnosing an issue specific to a Method Server, add a filter for only that agent in the Dashboard Properties Tip: if you are not sure how long balancing has been skewed, create a separate chart of that KPI for several previous hours Tip: if you are diagnosing an issue specific to a Method Server, add a filter for only that agent in the Dashboard Properties Tip: if you are not sure how long balancing has been skewed, create a separate chart of ActiveContexts KPI over several previous hours Breakdown by JDBC, JNDI, etc User Tracking
11 MS Status Example: Memory Affecting Response TimeEver wonder if GC is affecting user response times? PSM will tell you.
12 When you know a particular user is having an issue…User Tracking When you know a particular user is having an issue… Tip: Filter using CTRL-F or just start typing a string (i.e.- Smith) Tip: Create stored session for that user immediately using context menu Tip: for any particular action, create a stored session, then have the same user or another user try the action again and use the Comparison dashlet Custom Dashboards & BTs like User Tracking or Queue Entries can facilitate other approaches from functional or location-based POV (show how to filter) Drill down to purepaths, database, transaction flow, web requests Tip: Filter using CTRL-F or just start typing a string (Smith) Tip: Create stored session for that user immediately Tip: for a particular transaction, create a stored session, then have the user try it again and use the Comparison dashlet to verify reproducibility
13 Transaction Flow and Tier-based DiagnosticsIs there a problem with a particular node or tier in my system? Look for: proper loadbalancing, response time bottlenecks, errors in a tier, too much time spent in between tiers (could be a queued process or caching problem). Tip: Use Visualization Mode to add Transaction Response Time and Executions Per Transaction to the paths Tip: Try filtering by Business Transaction!
14 Host Health Overview and Process DashboardAnalyze the environment and JVM process health
15 Functional Health and AlertsLooking for errors without resorting to logs Last ½ hour & Preceding ½ hour divided Transaction Rate Hotspots – drill down and find purepaths or web requests, find user name as well Configure alerts and adjust thresholds
16 User Experience Monitor (Option from Compuware)Where are my users located? How is the response at each location?
17 UEM shows Client and Network ComponentsWhere is time for user actions spent? Server, Client, Network? User Experience – where is time spent? Network/client/server? Individual Page Performance
18 Example Diagnostic ScenariosQuick Primer User Complaint? Actions: Use the User Tracking Dashboard to find purepaths, web requests and/or database calls Export data and open TS call Compare response times to other user/site/timeframe Address bottleneck Memory Leak? Actions: Check Process Health for memory/GC trends Trigger Memory Dumps (3 types) Export data and open TS call Add memory sensors if needed for Selective Memory Analysis Adjust code or tune memory CPU Exhausted? Actions: Check Host/Process Dashboards CPU Sampling Thread Dumps Incident Drilldowns Export data and open TS call Rewrite problem code or customization
19 What are PurePaths?? Are a way of representing a transaction tracked across the system layers using information recorded at each tier Are captured with a sampling algorithm. This can result in no information for many steps in a PurePath, especially for steps which happen quickly. Are like looking at frames in a film. Scanning and seeing the whole thing in slow motion can be challenging, but significant conclusions can be made They can require a large screen to visualize effectively
20 Digging Into a Problem Using PSM –Now What?Guiding Principle: Assume the problem is with database interaction until proven otherwise. Identify the PurePath(s) of interest Understand Problem Type by looking at “PurePath Hotspots” Drill down to “Database” view to see the SQL related to the PurePath Sort by “Executions”, “Exec Max”, “Exec Total” & “Exec Avg” to identify SQL statements causing problems Develop a theory as to what the main problem(s) are: Long running SQL statement tune SQL/Database or both Too Many Executions Drill Down into PurePath and then search the knowledge base or Export file to PTC analysis It’s also sometimes also worth checking the #Diagnose Performance #Database View to see all of the SQL which has been run and then drill into the purepaths from there. The problem is that this view doesn’t always seem to display ALL of the SQL which was run. If this approach is used, I would double check the PurePaths view to make sure the most problematic things have been identified.
21 Step 1: Identifying the PurePath of InterestWhere to look? There are several places to start 1) Long Business Transactions 2) Pure Paths 2) Pure Paths 2) Pure Paths 3) User Tracking If I know the user who was having a problem I will start with the User Tracking Business Transaction If I am reviewing a file looking for problems I will often use the PurePath View to start with and then use the Web Requests 180s business transaction to identify problems. “Splittings” are usernames
22 Step1 Con’t: PurePath View – Sort by Response TimeIdentify the longest running PurePaths 2) Sort by “Size” 1) Sort by “Response time” Usability Tip 1 The size column represents the size of the purepath or the amount of information it contains. Generally small purepaths don’t have much of value Ignore Response Times of 60 seconds & Size less than 30; they contain no useful data 2) Sort by “Size”
23 Step 2: Understanding the Problem TypeExamine the “Hotspots View” “HotSpots View” 1) Two or Three Big Problems 3) One Big Problem Action: Drill Down into DB view 2) Many fast operations combined cause problems
24 Step 3: Sort to identify problem SQL statementsClick on column headings to sort The circled values are the ones that represent problems, and the corresponding SQL should be investigated further.
25 Step 2 (con’t): Select SQL –> Drill Down to PurePathsOR: CTRL-F to filter with part of the SQL statement 1) Drill Down PurePath 2) DB View PurePath View In this view the purepath was accessed by drilling down from a database statement. Something that is often done is to use the details of the SQL statement and copy a unique part of the SQL statement, often the ‘where’ and ‘from’ clauses and use this to filter in the pure path view. This generally works better than drilling down. It should also be noted, that it’s possible to select more than one SQL statement before drilling down.
26 Usability Tip 2: Adjust PurePath Display for ReadabilitySeveral Adjustments can make it easier to understand Result Shrink upper pane This should be easier to read “Show All Nodes” Is Critical to Select!! Move “Method” Column
27 First Example: One big problem in the PurePathSize of bar in “PurePath Hotspots” indicates magnitude of the problem One Big Problem Select HotSpot Drilling Down into the Database View is Optional but generally is a good idea I almost always check the DB view, and maybe the method hotspots view to make sure I understand what is going.
28 Second Example : Many small problemsOne SQL statement Executed too many times Result Pure Path Drill Down “Database” PurePath: shows exactly in the code where the problem is originating rom 15,115 executions! SQL from DB view Drill Down PurePaths Drill Down PurePaths
29 Usability Tip 3: Use Find and Filter functionalityUse “CTRL f” from anywhere to either Filter or Search Filter by problem SQL statement This is my favorite way to reduce the size & complexity of the PurePath view and to better understand what's going on
30 How to Check the KB for SolutionsInstructions Given to TSE’s on what to include in an article PurePath Name or URI Screenshot of PurePath Tree view showing stack Also include a text version of the stack so that it’s searchable How to generate a text version of the stack? Select the lines needed in the PurePath Tree view Copy & paste into Excel -> then paste into the article Other views needed to identify root cause See article CS for an example If the URI is too generic (e.g. /Windchill/servlet/SimpleTaskDispatcher), then also include the web request query. You can get this from the PurePath Tree view. Right-click the doFilter() node (should be the top node) and select Details. In the below example this was an add to workspace operation (action_id=Add). Make sure to filter out any customer specific data.
31 Sending PSM Data to Technical SupportRight Mouse Click #Export from many places Time and Date Filter 1) Unsure of what happened, but know about when. Use Export Session 2) Have Identified a specific problem, export from Business Transaction 3) Export from PurePath View When in doubt export more data than you think is necessary. It’s better to have more and filter some of it out, than to not have enough and not be able to understand the problems.
32 Takeaways Reasons to install and use PSMEasier to proactively monitor and diagnose system (User Tracking!) Faster turnaround times for performance issues Greater transparency into Windchill can show issues you’ve been ignoring How to start diagnosing Performance Problems Using PSM Dashboards make it easier to spot issues Drilldowns make it easy to do deeper diagnostics Filters make it easier to make sense of the “noise” How to use PSM Drill into a Performance Problem Spotting Common Performance Problems using PSM How to Check the KB for Solutions How Export and Send Data to PTC
33 Questions?
34 Delivering Product and Service AdvantageOur goal is to give customers a Product and Service Advantage Our mission is to provide technology solutions that transform how products are created and serviced Technology solution: We are a technology/software company, however we are not selling shrink wrap software. We provide the services and know-how to solve customers problems with the technology. Transform: Speaks to process change. We are in the business of changing customers processes. Create and Service products: Both the upfront knowledge work to create products and service offerings along with the manufacturing and servicing of those products. Technology solutions that transform the way you create and service products
35 Sometimes Even PSM is not enough…Why is TS asking for more data? PSM should cover >75% of performance cases, but not all of them Because of overhead concerns, dynaTrace is not configured to capture everything about every transaction: Sampling of information in purepaths rather than end-to-end instrumentation of codebase No bind variables for SQL statements, though this can be turned on if needed Not all logging information is passed from Windchill into PSM (fixes needed) Data may age out of session cache Exported session may contain incomplete purepaths or the wrong data Components of the Windchill system like cadworkers are not instrumented yet Client side capture of web browser and CAD software actions sometimes needed Windchill Profiler is occasionally needed for certain issues (high overhead) Adjustments to PSM System Profile sensors are occasionally needed