Skip to content

Commit c7fe8fe

Browse files
committed
update
1 parent 7d41629 commit c7fe8fe

File tree

2 files changed

+32
-1
lines changed

2 files changed

+32
-1
lines changed

public/calibration_plot.png

80.6 KB
Loading

src/components/DemoPage.jsx

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -354,7 +354,38 @@ if __name__ == "__main__":
354354
margin: '0',
355355
fontSize: '15px'
356356
}}>
357-
placeholder placeholder placeholder...
357+
In our technical evaluations, we first focus on validating GUM accuracy. We train GUM on recent email interaction, feeding each email---metadata, attachments, links, and replies---sequentially into the GUM. N=18 participants judged propositions generated by GUMs as overall accurate and well-calibrated: unconfident when incorrect, and confident when correct. Highly confident propositions (confidence = 10) were rated 100% accurate, while all propositions on average---including ones with low confidence---were fairly accurate (76.15%). From ablation studies, we show that all GUM components are critical for accuracy.
358+
359+
<div style={{
360+
display: 'flex',
361+
justifyContent: 'center',
362+
margin: '20px auto',
363+
width: '30%',
364+
backgroundColor: 'white',
365+
padding: '2px',
366+
borderRadius: '8px',
367+
boxShadow: '0 2px 8px rgba(0, 0, 0, 0.2)'
368+
}}>
369+
<img
370+
src="/calibration_plot.png"
371+
alt="GUM Calibration Results"
372+
style={{
373+
maxWidth: '100%',
374+
height: 'auto',
375+
}}
376+
/>
377+
</div>
378+
<p style={{
379+
textAlign: 'center',
380+
fontSize: '14px',
381+
color: 'var(--color-secondary-text)',
382+
marginTop: '10px'
383+
}}>
384+
Figure: GUMs are generally well calibrated. When errors occur, GUMs are underconfident in their propositions---the actual model's predictions lie above perfect calibration. In the user modeling setting, this is ideal. We should underestimate propositions to avoid eroding user trust.
385+
</p>
386+
387+
We then deploy GUMBO with N=5 participants for 5 days, with the system observing the participants' screens. This longitudinal evaluation replicated our results with the underlying GUM. Additionally, participants identified a meaningful number of useful and well-executed suggestions completed by GUMBO. Two of the five participants found particularly high value in the system and asked to continue running it on their computer after the study concluded. Our evaluations also highlight limitations and boundary conditions of GUM and GUMBO, including privacy considerations and overly candid propositions. Please read our <a href="https://arxiv.org" target="_blank" rel="noopener noreferrer" style={{ color: '#ff9d9d' }}>paper</a> for more details!
388+
358389
</p>
359390

360391
<h3 style={{

0 commit comments

Comments
 (0)