GeneralUserModels
diff --git a/‎public/calibration_plot.png
80.6 KB b/‎public/calibration_plot.png
80.6 KB
diff --git a/‎src/components/DemoPage.jsx
Lines changed: 32 additions & 1 deletion b/‎src/components/DemoPage.jsx
Lines changed: 32 additions & 1 deletion
@@ -354,7 +354,38 @@ if __name__ == "__main__":
           margin: '0',
           fontSize: '15px'
         }}>
-          placeholder placeholder placeholder...
+          In our technical evaluations, we first focus on validating GUM accuracy. We train GUM on recent email interaction, feeding each email---metadata, attachments, links, and replies---sequentially into the GUM. N=18 participants judged propositions generated by GUMs as overall accurate and well-calibrated: unconfident when incorrect, and confident when correct. Highly confident propositions (confidence = 10) were rated 100% accurate, while all propositions on average---including ones with low confidence---were fairly accurate (76.15%). From ablation studies, we show that all GUM components are critical for accuracy. 
+          
+          <div style={{ 
+            display: 'flex', 
+            justifyContent: 'center', 
+            margin: '20px auto',
+            width: '30%',
+            backgroundColor: 'white',
+            padding: '2px',
+            borderRadius: '8px',
+            boxShadow: '0 2px 8px rgba(0, 0, 0, 0.2)'
+          }}>
+            <img 
+              src="/calibration_plot.png" 
+              alt="GUM Calibration Results" 
+              style={{
+                maxWidth: '100%',
+                height: 'auto',
+              }}
+            />
+          </div>
+          <p style={{ 
+            textAlign: 'center',
+            fontSize: '14px',
+            color: 'var(--color-secondary-text)',
+            marginTop: '10px'
+          }}>
+            Figure: GUMs are generally well calibrated. When errors occur, GUMs are underconfident in their propositions---the actual model's predictions lie above perfect calibration. In the user modeling setting, this is ideal. We should underestimate propositions to avoid eroding user trust.
+          </p>
+
+          We then deploy GUMBO with N=5 participants for 5 days, with the system observing the participants' screens. This longitudinal evaluation replicated our results with the underlying GUM. Additionally, participants identified a meaningful number of useful and well-executed suggestions completed by GUMBO. Two of the five participants found particularly high value in the system and asked to continue running it on their computer after the study concluded. Our evaluations also highlight limitations and boundary conditions of GUM and GUMBO, including privacy considerations and overly candid propositions. Please read our <a href="https://arxiv.org" target="_blank" rel="noopener noreferrer" style={{ color: '#ff9d9d' }}>paper</a> for more details!
+
         </p>
 
         <h3 style={{