• Sat. Oct 1st, 2022

Splunk: Optimize Mobile App Launch with Splunk Real User Monitoring

ByCindy J. Daddario

Feb 16, 2022

One of the hardest and most rewarding things I do as a Senior Software Engineer in our Splunk Mobile division is ensuring that our customers’ experience meets the quality and standards that we are committed to. .

My team and I are part of an on-call rotation committed to measuring and optimizing key Service Level Indicators (SLIs) using the Splunk Real User Monitoring (RUM) and Splunk On-Call mobile apps (iOS and Android). Two such SLIs that keep us up at night are application startup time and interaction time (what we call “Time to Ready”). These metrics measure the time it takes for our applications to be fully functional and ready for user interaction.

In order to track results and progress, we have created graphs and alerts that send an alarm to an on-call engineer whenever our SLIs fail to meet our standards.

Our on-call engineer is notified when the p75 of “App Startup Time” or “Time To Ready” is greater than 5 seconds. Once paginated, we break down the RUM metrics by platform, application version, and OS version to determine if the new code has any performance impact. Additionally, we have detailed information on the Session Details page for each instance of application startup time or longer than expected preparation time. With every page we receive; either we gradually improve our SLIs or we add other custom events to better understand the “Ready” sequence. We hold post-incident review meetings to discuss each page and the steps taken to improve our “App Startup Time” and “Time to Ready”.

The 3:16 page

Shortly after signing on with one of our biggest clients to date, our on-call engineers were notified and woken up in the middle of the night up to twice a week. Looking at the data in Splunk RUM, we learned that the “o11y_fetch_and_store_dashboards” time was extremely long for this new customer.

Waiting on the backend

During the incident, we identified that the API response time to retrieve user preferences was extremely high (5.8 seconds). Although this API call needed to be seen multiple times for large datasets, we only saw one, i.e. large datasets needed to be paginated.

When connecting the Splunk RUM trace to Splunk APM, we found that the main query used was suboptimal and not paginating the results as we expected. By implementing a fix for the impacted API, we immediately reduced the preparation time by 10%.

loop loop

The collected metrics showed us some code where we were iterating through a list of contacts to extract fields. The loop took over 50% of the total preparation time. It turned out that the code went through the contact list three times to extract key fields leading to a long “o11y_dashboard_list_favorite_load_time”.

Sub-optimal code went unnoticed in our QA environments because iterations were fast on short contact lists; but for our larger customers it was extremely slow and the problem was exacerbated. By optimizing the code to iterate through the contact list only once, we reduced up to 5 seconds for our large customers in the next release, improving prepare time performance by 33% and bringing our after-hours pages to zero.

Real-Time Observability + Mobile Calling = Happier Customers

After the new release, we had an increase in engagement from the new customer, which helped us retain them for the long term. Our on-call rotations put our team of mobile developers in control of our SLIs, and adding Splunk RUM to our mobile observability stack has made it easier than ever to improve them. Getting started is easy for iOS and Android – sign up for a free trial here.

This blog was co-authored by Seerut Sidhu, Sr. Product Manager @Splunk.