Hypothetical product change aimed at increasing listening engagement

Overview

We evaluate a hypothetical product change aimed at increasing user engagement on the platform.

We conduct an A/B test to determine whether the treatment has a meaningful impact on listening time.

  • Primary metric: Listening time per user

Exploratory Data Analysis

Dataset Overview

  • Contains 8,000 users and 13 features

  • Time window is not explicitly documented in the source dataset

  • Presence of weekly aggregated variables suggests a short-term behavioral snapshot, but exact duration is unknown

Table 1: Dataset features overview
Column Description Type
user_id Unique user identifier ID
gender User gender Categorical
age Age of user Numeric
country Country of user Categorical
subscription_type Free / Premium subscription Categorical
listening_time Total listening time Numeric
songs_played_per_day Avg songs played per day Numeric
skip_rate Fraction of songs skipped Numeric
device_type Device used (mobile/desktop/etc.) Categorical
ads_listened_per_week Ads listened per week (Free users) Numeric
offline_listening Whether user uses offline mode Binary
is_churned Whether user churned Binary
treatment_group A/B test assignment (control,treatment) Categorical

Experiment Design

We simulate an A/B test, evaluating the impact of an improved recommendation system on user engagement and churn.

  • Stratified randomization by subscription type
  • 70/30 treatment-control split within each group
  • Ensures balanced representation across subscription types
  • Allows us to estimate treatment effects both overall and within subscription segments
  • Control: existing recommendation system
  • Treatment: improved personalization experience
  • Unit of randomization: user-level

Randomization Check

We compare pre-treatment characteristics across control and treatment groups to validate the random assignment.

Looking at the most important characteristics, subscription and device type, we see no meaningful differens between control and treatment groups.



Table 2: Subscription Type Distribution by Treatment Group
treatment_group Control Treatment
subscription_type    
Family 23.6% 24.4%
Free 24.9% 25.9%
Premium 26.7% 25.8%
Student 24.7% 23.9%
Table 3: Device Type Distribution by Treatment Group
treatment_group Control Treatment
device_type    
Desktop 34.7% 34.7%
Mobile 32.6% 32.3%
Web 32.7% 33.0%

Impact on Listening Time

Key Results

  • The treatment had a slightly negative impact on listening time
  • Very similar distributions of listening time between the groups
Table 4: Impact on Listening Time
Metric Control Treatment Relative Lift
Listening Time 155.1 151.8 -2.1%



Statistical Significance (Welch’s t-test)

We test whether the observed difference in listening time between groups is statistically significant.

Why Welch’s t-test?

  • Treatment and control groups are unequal in size (70/30 split)
  • Variance between groups may differ
  • Welch’s t-test does not assume equal variances
  • Robust choice for real-world A/B tests
  • Two-sided test is used as we do not assume a directional effect

Hypotheses

  • Null hypothesis: no difference in mean listening time
  • Alternative hypothesis: difference exists

Result

  • The test is not statistically significant (p > 0.05)
  • We fail to reject the null hypothesis

Interpretation

  • No strong evidence that the treatment changes listening time
  • Observed decrease in listening time may be due to random variation



Table 5: Welch’s t-test comparing mean listening time between Control and Treatment groups
t-statistic p-value
-1.59 0.11

Effect Size Interpretation (Listening Time)

We estimate a small negative change in listening time for the treatment group compared to control.

Key Findings

  • Treatment shows a small decrease in listening time
  • Effect size is negative but small in magnitude
  • Difference is not statistically significant (p > 0.05)

Interpretation

  • No strong evidence of a real impact on engagement
  • Observed difference is likely due to random variation
  • Effect is too small to draw product conclusions

Summary Metrics

Table 6: Impact on Listening Time
Metric Control Treatment Relative Lift
Listening Time 155.1 151.8 -2.1%



Table 7: Welch’s t-test comparing mean listening time between Control and Treatment groups
t-statistic p-value
-1.59 0.11

Segmented Effects Based on Subscription Type (Listening Time)

We examine whether the treatment effect varies across subscription types.

Key Findings

  • We observe variation in treatment response across subscription segments
  • Overall effects remain small in magnitude (≈ -5% to +3%)
  • No consistent directional pattern across all groups
  • Small positive change for Students (+2.6%)

Subscription Type Breakdown

Table 8: Mean listening time by subscription type with lift vs Control (Treatment − Control) / Control
treatment_group Control Treatment Relative Lift
subscription_type      
Family 153.2 146.4 -4.4%
Premium 157.0 152.0 -3.2%
Free 156.5 151.7 -3.1%
Student 153.3 157.2 2.6%

Segmented Effects by Device Type (Listening Time)

We examine whether the treatment effect varies across device types.

Key Findings

  • Small negative effect observed for mobile and desktop
  • No meaningful change for web users
  • Effects are consistent in direction across major device types

Interpretation

  • No device segment shows a positive response to the treatment
  • Slightly negative effects across key platforms suggest lack of product benefit
  • Results reinforce the overall neutral to negative impact observed

Device Type Breakdown

Table 9: Mean listening time by device type with lift vs Control (Treatment − Control) / Control
treatment_group Control Treatment Relative Lift
device_type      
Desktop 158.4 151.5 -4.4%
Web 156.5 153.4 -2.0%
Mobile 150.1 150.5 0.3%

Guardrail Metrics (Treatment Safety Check)

We evaluate whether the treatment negatively impacts retention or engagement quality.

Key results:

  • Churn rate: slightly higher in treatment (~+1.9%)
  • Skip rate: no meaningful difference between groups
  • Engagement metrics remain stable

Interpretation:

  • No strong evidence of degradation in engagement quality
  • Slight increase in churn may indicate a potential risk, but is not conclusive
  • Results should be interpreted cautiously
Table 10: Guardrail metrics (Control vs Treatment with relative lift)
treatment_group Control Treatment Relative Lift
Churn rate 25.3% 27.2% 7.5%
Skip rate 30.1% 29.9% -0.5%

Conclusions & Recommendations

Executive Summary

  • No statistically significant effect on listening time
  • Small negative average effect observed
  • No consistent effects across subscription types or device segments
  • Slight increase in churn, but not conclusive
  • No meaningful change in skip rate

Conclusion:

The treatment does not demonstrate a meaningful impact on user engagement.

Recommendation

Based on the experiment results:

  • Primary metric: no statistically significant impact on listening time
  • Guardrails: no clear evidence of degradation in customer retention or engagement quality
  • Segment analysis: no consistent positive effects across subscription types

Decision:

  • Do not implement the current version of the treatment
  • Results do not demonstrate a meaningful improvement in user engagement
  • Recommend redesign and re-test

Limitations & Next Steps

  • Moderate variability in listening time (coefficient of variation = 0.55) reduces ability to detect small effects
  • Observed effects are small, making it difficult to distinguish signal from noise
  • Results reflect short-term behavior and may not capture longer-term engagement patterns

Next Steps:

  • Iterate on the treatment design to create a stronger user impact
  • Consider increasing sample size or experiment duration in future tests
  • Continue monitoring retention metrics, given the slight increase in churn

Appendix

Additional Randomization Checks

We compare pre-treatment characteristics across control and treatment groups to validate the random assignment.

Looking at the most important characteristics, gender and age, we see no large differens between control and treatment groups.



Table 11: Gender Distribution by Treatment Group
treatment_group Control Treatment
gender    
Female 33.4% 32.8%
Male 33.2% 34.7%
Other 33.4% 32.5%
Table 12: Age Distribution by Treatment Group
treatment_group Control Treatment
age_group    
<18 6.9% 6.1%
18-25 16.3% 16.1%
26-35 21.8% 21.7%
36-50 33.9% 35.7%
50+ 21.0% 20.4%