Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback (2024)

Dong Won Lee¹ Hae Won Park¹ Yoon Kim¹ Cynthia Breazeal¹ Louis-Philippe Morency²
Massachusetts Institute of Technology¹, Carnegie Mellon University²
dongwonl@mit.edu

Abstract

We describe an approach for aligning an LLM-based dialogue agent based on global (i.e., dialogue-level) rewards, while also taking into account naturally-occurring multimodal signals. At a high level, our approach (dubbed GELI) learns a local, turn-level reward model by decomposing the human-provided Global Explicit (GE) session-level reward, using Local Implicit (LI) multimodal reward signals to crossmodally shape the reward decomposition step. This decomposed reward model is then used as part of the standard RLHF pipeline improve an LLM-based dialog agent. We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.

Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback

Dong Won Lee¹ Hae Won Park¹ Yoon Kim¹ Cynthia Breazeal¹ Louis-Philippe Morency²Massachusetts Institute of Technology¹, Carnegie Mellon University²dongwonl@mit.edu

1 Introduction

Developing social dialogue agents that can interact and collaborate with humans over a long horizon remains a longstanding goal of artificial intelligence. Large language models (LLM) pretrained at scale on the next-word prediction objective and then subsequently aligned to human preference via RLHF (Reinforcement with Human Feedback) represent a significant step in this direction Ouyang etal. (2022), even leading to successful commercial applications.

However, existing methods for alignment usually assume that preference labels are annotated at the turn-level (i.e., after each utterance). This makes it difficult to extend this framework to cases where human preference labels are only available at the session-level, i.e., after an entire dialogue session (which could span 30 minutes or more). Insofar as we are interested in developing dialogue agents that can continually learn from session-level dialogue data “in the wild” (e.g., through in-person conversations), there is a need to develop techniques that can (1) align agents based on global rewards at the session level and (2) take into account extralinguistic multimodal signals that are pervasive in naturally-occurring conversations.

Concretely, a session-level score obtained post-conversation is a form of global explicit feedback, which provides a holistic assessment of a conversation session. Such feedback can be obtained naturally at scale by, for example, asking participants to rate how they felt about the dialog session. However, it is not possible to use such data directly as part of an RLHF pipeline, since current methods generally require local, turn-level signals for aligning an LLM-based dialogue agent to human preferences.

Moreover, in real world settings and domains, agents are deployed in multisensory environments Benford etal. (1997) where they have access to rich multimodal signals (e.g., facial expressions during a video conversation). An ideal agent should leverage these signals as proxy rewards to improve its behavior. In dialogue, previous work attribute many multimodal cues such as body mimicry, vocal accommodation, and emotion, as implicit measures of conversation quality Louwerse etal. (2012). Hence, we can utilize multimodal signals as a form of local implicit feedback, which presents an opportunity to utilize multimodal local implicit feedback as signals to crossmodally guide the decomposition of the single global explicit (GE) post-interaction score.

In this paper, we describe a joint framework called GELI, which integrates global explicit (GE) and local implicit (LI) feedback. GELI makes it possible to align an LLM-based dialogue agent based on global rewards, while simultaneously taking into account naturally-occurring multimodal signals. Our formulation brings together the idea of training a reward model which decomposes a single global explicit annotation score that is shaped by local implicit multimodal signals, which is subsequently used to align an LLM-based dialogue agent via RLHF.Specifically, we use GELI to learn a reward function based on the overall affect of the user (i.e., how positive the user felt at the end of the conversation) from a large-scale long-horizon multimodal dialogue dataset Reece etal. (2023). Our local implicit multimodal signal comes from an affect classifer based on facial expression. We find that the reward function learned via GELI can be used train a dialogue agent that has improved ability across various metrics of conversational quality including sensibleness, reusability, and specificity Lee etal. (2022).

2 Related Works

Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback (1)

Reward Design

The design of the reward function can drastically change the performance of RL agents. Paradigms such as reward shaping have shown to be effective at enabling the RL agent to converge quickly and improve performance (Mataric, 1994; Ng etal., 1999a; Devlin etal., 2011; Wu and Tian, 2016; Song etal., 2019). In addition, inverse RL (Ng etal., 2000; Fu etal., 2017) has shown to be useful at extracting rewards from human expert trajectories. Furthermore, intrinsic reward functions (Sorg etal., 2010; Zheng etal., 2018, 2020; Guo etal., 2018; Gangwani etal., 2018), a class of methods which uses the agent’s own learning progress, have shown to be useful at guiding the agent’s behavior by fostering self-improvement and adaptive exploration.

Temporal Credit Assignment

Temporal Credit Assignment (TCA) is a concept within the field of reinforcement learning and artificial intelligence that addresses the challenge of attributing credit to actions over time. It involves determining the extent to which past actions contributed to the current outcome, allowing an intelligent agent to understand the consequences of its decisions. One way to apply TCA to reinforcement learning is by manipulating the $\lambda$ -discount factor and investigating how this affects policy learning (Petrik and Scherrer, 2008; Jiang etal., 2015). Recently, a line of works have been proposed to treat TCA as a return decomposition. RUDDER Arjona-Medina etal. (2019) assigns step-wise credit by the predictive difference between two consecutive states. IRCR Gangwani etal. (2020) is an instantiation of uniform reward redistribution. Randomized return decomposition (RRD) Ren etal. (2021) formulate a surrogate problem through Monte-Carlo sampling estimating step-wise rewards via least-squares estimation.

Aligning Language Models To Human Preferences

Incorporating human preference feedback into a reward model, and subsequently optimizing a language model to output text that reward model scores highly with an RL algorithm, has been shown to result in language models that generate outputs humans generally prefer (Ouyang etal., 2022). This process has been applied to summarization(Ziegler etal., 2019; Stiennon etal., 2020; Wu etal., 2021), answering questions with long-form answers using text retrieved from the web(Nakano etal., 2021; Menick etal., 2022), generating engaging responses in a dialogue settings(Thoppilan etal., 2022; Cohen etal., 2022) and following human instructions(Kojima etal., 2021; Suhr and Artzi, 2022; Kim etal., 2023b). However, these methods generally require collecting fine-grained annotations for each generated response to train the reward function, which is difficult to obtain at scale for long-horizon dialogue.

Utilizing Implicit Signals for Dialogue Agents

Many previous work utilize local implicit signals found only in the text, such as existence of next human turn, next human turn length, mean conversation length, sentiment and reaction in the next human utterance, retry rate, retention rate, or user rating Pang etal. (2023); Irvine etal. (2023). In contrary, ours is the first (1) to additionally utilize multimodal signals, and (2) use global signals in conjunction with the local implicit signals, which has been a crucial finding that contributed significantly to the performance boost in the human evaluation.

3 Background

Language Models As Conversational Agents. We are interested in generating conversational responses with an autoregressive language model in a multi-sensory setting. We treat a conversational language model as an agent with a policy $\pi_{\phi}$ Liu etal. (2018); Liang etal. (2020); Wen etal. (2016); Thoppilan etal. (2022), which is parameterized by $\phi$ . The utterance generated at turn $t$ , given access to the textual dialogue history $s_{t}$ is defined to be the action $a_{t}$ . To be more specific, the dialogue until turn $t-1$ is defined as $s_{1},a_{1}...,s_{t_{2}},a_{t-2},s_{t-1}=s_{\left[:t-1\right]}$ , for brevity we will call this $s_{\left[:t-1\right]}=s_{t}$ . Therefore, the auto-regressive LLM policy, $\pi_{\phi}(s_{t})$ , takes in as input $s_{t}$ and outputs a distribution over $a_{t}$ .

Reinforcement Learning with Human Feedback (RLHF).

RLHF is commonly used to adapt an agent $\pi_{\phi}$ to be aligned to human feedback. Given a reward function which can gauge the quality of individual generated utterances, we can perform adaptation via reinforcement learning with human feedback (RLHF) Ouyang etal. (2022); Jaques etal. (2020); Stiennon etal. (2020). Specifically, for turn $t$ , our reward function $r_{\theta}(s_{t},a_{t})$ parameterized by $\theta$ takes in as input the context utterance $s_{t}$ and the generated response $a_{t}$ to predict the reward at the utterance level. It is also typical to use a KL term to penalize RL policy from diverging from the pretrained model, resulting in the following objective,

\max_{\phi}\,\,\mathbb{E}[r_{\theta}\left(s_{t},a_{t}\right)]-\gamma D_{KL}(%\pi_{\phi}\left(\cdot|s_{t}\right)||\pi_{\eta}\left(\cdot|s_{t}\right)),

(1)

where $\pi_{\eta}$ is a reference model.

4 Methods: GELI

The reward function $r_{\theta}$ in standard adaptation techniques relies on intermediate fine-grained annotations, requiring manual human annotations at each generated utterance. However, in many long-term dialogue settings there is only a single global explicit (GE) annotated reward for each session. Given a trajectory of the multi-turn dialogue $\tau$ , the global explicit reward $R_{GE}(\tau)$ is a scalar reward at the end of the interaction, such as how positively the user felt about the conversation. This GE reward can be decomposed via sum decomposition (more details in Sec. 4.1) with the GE loss function $\mathcal{L}_{\text{GE}}$ . A core novelty of our proposed GELI approach is that the decomposition of the GE reward will be guided by some Local Implicit (LI) feedback. Concretely, in many dialog applications/datasets of interest there are rich multimodal signals, which is can provide intermediate signals that are useful for the decomposition of the single global explicit reward. We thus perform cross-modal distillation of the signals from such multimodal signals into the individually decomposed text-only reward function via the LI loss function $\mathcal{L}_{\text{LI}}$ (more details in Sec. 4.2).

Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback (2)

In practice, our reward function $r_{\theta}$ is optimized with a joint objective which enables the (1) redistribution of the global explicit (GE) reward and (2) inclusion of local implicit (LI) reward signals as a reward shaping function.

\mathcal{L}_{\text{GELI}}=\lambda\mathcal{L}_{\text{GE}}(\theta)+(1-\lambda)%\mathcal{L}_{\text{LI}}(\theta)

(2)

In the following sections, we share more details about the global explicit decomposition and local implicit crossmodal reward shaping.

4.1 GE: Decomposing One Global Explicit Annotation

Global explicit reward is a human annotation at the end of the interaction, which can come in the form of a post-interaction score. Let $\tau$ denote the trajectory of the episode, i.e. $\tau=\left\langle s_{0},a_{0},s_{1},a_{1}\cdots,s_{T},a_{T}\right\rangle$ . This reward represents the overall reward of trajectory $\tau$ , $R_{\mathrm{GE}}(\tau)$ . The agent in this episodic reinforcement learning paradigm must maximize the expected global explicit reward at the end of the conversation. One way to approximate the global explicit reward $R_{\mathrm{GE}}(\tau)$ is by sum decomposition via considering the sum of $r_{\theta}(s_{t},a_{t})$ , across all the previous states $s_{t}$ and newly generated $a_{t}$ :

R_{\mathrm{GE}}(\tau)\approx\sum_{t=0}^{T-1}r_{\theta}\left(s_{t},a_{t}\right)

(3)

Then, this idea of sum-based return decomposition (RD), can be implemented via a least-squares-based approach, where the reward distribution is given by a learnt reward function, decomposing the episodic reward $R_{\mathrm{GE}}(\tau)$ in an additive way Arjona-Medina etal. (2019).

\displaystyle\mathcal{L}_{\text{GE}}(\theta)=\mathop{\mathbb{E}}_{\tau\sim%\mathcal{D}}\left[\biggl{(}R_{\text{GE}}(\tau)-\sum_{t=0}^{T-1}r_{\theta}(s_{t%},a_{t})\biggr{)}^{2}\right]

(4)

Application to Conversational LLMs: To alleviate the computation costs arising from the long horizon nature of conversations and language modeling costs, we employ an alternative of the least-squares-based return decomposition method, by utilizing Randomized Return Decomposition (RRD; Ren etal., 2021). RRD improves the scalability of least-squares-based reward redistribution methods by using a Monte-Carlo estimator to compute the predicted episodic return. We refer the readers to Appendix A for more details on RRD.

4.2 LI: Crossmodal Reward Shaping with Local Implicit Multimodal Signals

The reward decomposition offers a way to redistribute the rewards from a single reward in an application-agnostic way. However, in natural dialogue there are rich extralinguistic signals (e.g., facial expressions, prosody) that provide an indication of how the conversation is being received. We thus propose to guide the decomposition such that it is shaped by local implicit (LI) multimodal signals. This is essentially using such signals as a form of reward shaping, which is valuable if they are known to be aligned with the final objective Ng etal. (1999b).

In our multi-sensory setting, we have access to the multimodal signals in response to the agent’s actions $a_{t}$ , which contains implicit signals that are correlated with the final reward. We will call this multimodal state $s^{mm}_{a_{t}}$ . If we have access such multimodal signals, we can design a reward function $\Gamma$ which utilizes the multimodal signal $s^{mm}_{a_{t}}$ to determine a proxy reward. Then, we can formulate this problem set up as a form of crossmodal knowledge distillation (KD) Xue etal. (2022); Thoker and Gall (2019) for reward shaping. Therefore, we can express the local implicit reward $r_{LI}$ with a proxy label from a multimodal input.

r_{\text{LI}}(s^{mm}_{a_{t}})=\Gamma(s^{mm}_{a_{t}})

(5)

$\Gamma$ indicates a designed score function from domain knowledge which captures the relationship the GE reward and the multimodal local implicit signals. Therefore, a general formulation of the loss function to induce the crossmodal knowledge distillation of local implicit multimodal feedback signals to the reward function $r_{\theta}$ which only has access to textual dialogue states and actions $(s_{t},a_{t})$ , we have the following:

\mathcal{L}_{\text{LI}}(\theta)=\underset{{s_{t},a_{t},s^{mm}_{a_{t}}}\sim D}{%\mathbb{E}}\left[\left(r_{\text{LI}}(s^{mm}_{a_{t}})-r_{\theta}\left(s_{t},a_{%t}\right)\right)^{2}\right]

(6)

Application to Conversational LLMs: Our GE reward indicates how positively the conversation made the other participant feel. It is known from previous work Ruusuvuori (2012), that the facial affect of the listener is related to how the conversation is being perceived and the implicit conversation quality. Thus, we design the shaped reward $r_{LI}(s^{mm}_{a_{t}})$ to capture this intuition. Therefore, we utilize the implicit visual feedback from a facial affect classifier as a way to encourage a decomposition informed by visual affective signals. Given a facial affect classifier $f$ and access to multimodal states $s^{mm}_{a_{t}}$ (in this case vision), which outputs the affect of the listener, we implement an indicator function where we assign a score of 1 if the facial affect of the listener is positive and 0 otherwise.

\Gamma(s^{mm}_{a_{t}})=\begin{cases}1,&\text{}f(s^{mm}_{a_{t}})=\textit{%positive affect}\\0,&\text{otherwise}\end{cases}

(7)

Note, that this is one of many ways to design the score function $\Gamma$ , The design of the score function $\Gamma$ , to capture the relationship between local multimodal signals and the single global explicit reward leaves exciting research opportunities.

5 Experiments

In this section, we describe our experiments to evaluate our proposed GELI framework which performs reward function training with global explicit reward decomposition and local implicit visual feedback. All experiments are performed by (1) first, training a reward function (e.g. using GELI or one of its ablation variant only GE or only LI) (2) and use the trained reward functions in a reinforcement learning setup with PPO Schulman etal. (2017) to adapt the language model in generating better conversational responses. Due to computational resources, the training of reward functions and adaptations are performed over a single run.

5.1 Dataset

Our experiments are based on the CANDOR Reece etal. (2023) dataset, due to its long-term nature (length of conversations 31.3 mins on average), large-size (1656 conversations, 7+ million word, 850-hours). The CANDOR dataset also includes video data, which is often found in other face-to-face conversation datasets. CANDOR is used to train our reward function and to sample dialogue histories for the generations. We construct separate held-out sets for the reward function training ( $\sim$ 30,000 dialogue history-utterance pairs) and updating the language model ( $\sim$ 100,000 history-utterance pairs). We optimize for the “overall-affect” global explicit score from the post-interaction survey, which given by the answer to the following question: “Overall during your conversation, to what extent did you feel positive feelings (e.g., good, pleasant, happy) or negative feelings (e.g., bad, unpleasant, unhappy)?”

5.2 Baseline Models

We compare GELI with multiple state-of-the art reward decomposition methods which could decompose the single global explicit (GE) reward. For fair comparison, we also compare the performance of the reward decomposition when we only use the local implicit (LI) multimodal rewards.

For all the methods mentioned below, we fine-tune additional linear layers on top of a small BART Lewis etal. (2019) language model, which was previously finetuned for conversational summary.¹¹1https://huggingface.co/kabita-choudhary/finetuned-bart-for-conversation-summary This also demonstrates that smaller language models may be powerful enough to discern patterns for desirable adaptations.

GE: (RRD) Randomized return decompositionRen etal. (2021) is aimed at learning a proxy reward function for episodic reinforcement learning. It formulates the decomposition as a surrogate problem through Monte-Carlo sampling, enabling the extension of least-squares-based reward redistribution to address long-horizon problems.

GE: (IRCR) Iterative Relative Credit RefinementGangwani etal. (2020) is an instantiation of uniform reward redistribution. The non-parametric reward redistribution mechanism employed by IRCR involves setting the proxy reward for a transition as the normalized value of the associated trajectory return.

GE: (RUDDER) Return Decomposition for Delayed RewardsArjona-Medina etal. (2019) employs a return predictor trained on trajectories, and step-wise credit assignment is determined by the predictive difference between two consecutive states. Through the utilization of the LSTM warm-up technique, this transformation ensures that its training computation costs are not contingent on the task horizon T, enabling adaptability to long-horizon tasks.

LI: Visual Affect (VA): As a form of implicit feedback, we use facial affect present in visual signals as described in section 4.2. The facial affect classifier is a CNN-based image-based emotion detection model trained on AffectNet Mollahosseini etal. (2017). The predictions are captured in 2 second sliding windows.

LI: Language Sentiment (LS): We also utilize the utterance of the speaker to check whether if we could utilize the sentiment of this utterance as a form of implicit feedback. We utilize a mDeBERTa He etal. (2020) pretrained sentiment classifier.²²2https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student

Evaluation:

For the trained reward functions, we compute the $L_{GE}(\theta)$ , which is the MSE between $R_{GE}$ and the sum of all predicted rewards $r_{\theta}(s_{t},a_{t})$ as described in Eq. 4. We also calculate the difference of the expected predicted returns of $\Delta\hat{r}_{LI}$ conditioned on the local implicit multimodal reward: $\Gamma(s^{mm}_{t})$ . With our choice of the score function as described in Eq. 7, this can be written as:

\begin{split}\Delta\hat{r}_{LI}&=\mathbb{E}\left[r_{\theta}(s_{t},a_{t})|f(s^{%mm}_{a_{t}})=\textit{positive affect}\right]\\&-\mathbb{E}\left[r_{\theta}(s_{t},a_{t})|f(s^{mm}_{a_{t}})\neq\textit{%positive affect}\right]\end{split}

(8)

Intuitively, this can be seen as the difference in the predicted reward scores of the text-only utterance conditioned on the visual facial expression which we are using as local implicit feedback rewards (e.g. the difference of the reward score of the utterance when the User responds with a positive affect vs. a negative affect). Given our choice of the score function $\Gamma$ , given Eq. 7, $\Delta\hat{r}_{LI}$ should be greater than 0, if assume that a positive visual affect indicates that the associated utterance is contributing positively to $R_{GE}$ , i.e. how the utterance is being received by the listener.

5.3 Updating Language Models with Reinforcement Learning

We use LLAMA-2 Touvron etal. (2023) as the base model and with a default prompt shown in Fig. 3. We adapt the LLAMA-2 model with reinforcement learning with human feedback by utilizing the above-mentioned reward functions which has been trained to decompose the reward and perform ablations to demonstrate the effectiveness of GELI. We utilize TRL implementation of RLHF with PPO von Werra etal. (2020). Furthermore, we utilize LoRA Hu etal. (2021) for computational constraints. We share our detailed hyperparameters in Appendix F.

Evaluation:

We run a human study based on the 8 metrics commonly used in literature to evaluate the quality of the generated utterances Lee etal. (2022). We recruited a total of 300 crowd workers on Amazon Mechanical Turk. For each of the sample, including dialogue history and responses, users were asked to rate which model(s) satisfied the given criterion. At the end of the survey, annotators were asked to describe which chatbot they would talk to again.

6 Results and Discussion

Feedback Type	Baselines	$L_{GE}$ $\downarrow$	$\Delta\hat{r}_{LI}>0$
	Human	N/A	0.087 ± 0.05
	Mean	245.495	0.000
	Mode	289.473	0.000
GE	IRCR	394.041	0.008
	RUDDER	285.720	0.003
	RRD (K = 32)	172.246	0.007
	RRD (K = 160)	188.382	0.008
LI	Visual Affect (VA)	1546.17	0.256
LI	Language Sentiment (LS)	825.31	0.010
GELI	IRCR + VA	722.687	0.392
	RUDDER + VA	623.882	0.030
	RRD + VA (Ours)	176.897	0.063

In this section, we discuss the quantitative and qualitative results of our experiments. We first describe the results for the reward decomposition training. Then, we discuss the results of the human evaluation of generations that are trained with the decomposed reward functions via reinforcement learning.

6.1 Reward Function

Reward Decomposition ( $L_{GE}$ ):

We refer the readers to the rows corresponding to ’GE’ on the left side of Table 1, where we display the MSE of the reward decomposition loss, as described in Eq. 4. We find that amongst the three return decomposition methods, RRD performs the best. We also compare the results when we use only the local implicit (LI) multimodal rewards directly as rewards and find that they perform significantly worse than that of GE decomposition methods.

CANDOR Reece etal. (2023)	Connection	Positivity	Social	Inclination	Interestingness	Reuse	Specific	Sensible	GELI Score
	(/100%) ↑								↑
Human	16.00 $\pm$ 2.83	16.33 $\pm$ 4.03	19.67 $\pm$ 1.89	17.33 $\pm$ 6.65	17.33 $\pm$ 6.55	17.33 $\pm$ 3.09	82.67 $\pm$ 7.93	85.33 $\pm$ 4.5	N/A
LLAMA2	30.67 $\pm$ 8.73	26.67 $\pm$ 6.65	25.67 $\pm$ 8.38	26.00 $\pm$ 5.66	24.33 $\pm$ 7.76	28.0 $\pm$ 5.72	77.33 $\pm$ 6.18	80.33 $\pm$ 5.91	0.4929
LLAMA2 + GE: RRD	21.33 $\pm$ 6.80	16.33 $\pm$ 1.70	18.00 $\pm$ 2.16	17.67 $\pm$ 1.25	18.00 $\pm$ 2.83	11.33 $\pm$ 4.03	68.67 $\pm$ 6.34	69.0 $\pm$ 5.1	0.5072
LLAMA2 + LI: LS (Language Sentiment)	20.67 $\pm$ 7.04	21.00 $\pm$ 4.90	21.00 $\pm$ 5.72	18.33 $\pm$ 8.22	23.00 $\pm$ 3.56	22.0 $\pm$ 6.98	82.0 $\pm$ 3.74	89.67 $\pm$ 4.19	0.4852
LLAMA2 + LI: VA (Visual Affect)	22.67 $\pm$ 4.19	25.33 $\pm$ 5.44	31.33 $\pm$ 0.47	28.67 $\pm$ 3.4	19.33 $\pm$ 3.68	26.0 $\pm$ 0.82	67.67 $\pm$ 4.71	90.0 $\pm$ 2.16	0.4858
LLAMA2 + GELI: RRD+VA (Ours)	39.67 $\pm$ 7.32	44.33 $\pm$ 12.23	35.33 $\pm$ 10.87	37.33 $\pm$ 6.85	38.0 $\pm$ 10.2	41.67 $\pm$ 7.04	80.33 $\pm$ 4.5	80.67 $\pm$ 10.5	0.5419

SODA Kim etal. (2023a)	Connection	Positivity	Social	Inclination	Interestingness	Reuse	Specific	Sensible
	(/100%) ↑
GPT-3.5 (text-davinci-002)	40.1 ± 7.56	43.05 ± 3.4	48.13 ± 9.08	46.05 ± 3.44	49.11 ± 7.69	44.03 ± 2.01	78.14 ± 9.49	80.07 ± 7.72
LLAMA2	66.04 ± 4.79	70.0 ± 2.51	71.99 ± 6.28	67.0 ± 0.46	55.05 ± 8.24	65.99 ± 6.3	89.04 ± 2.65	89.99 ± 3.81
LLAMA2 + GE: RRD	30.98 ± 2.66	30.98 ± 5.04	34.04 ± 3.28	27.0 ± 7.43	24.98 ± 2.69	30.0 ± 2.51	43.97 ± 3.3	47.06 ± 4.34
LLAMA2 + LI: LS	62.0 ± 3.71	70.06 ± 4.52	75.02 ± 5.06	68.04 ± 3.41	59.0 ± 1.24	68.01 ± 3.72	86.04 ± 2.61	92.99 ± 1.47
LLAMA2 + LI: VA	55.02 ± 1.92	57.1 ± 7.21	63.04 ± 4.76	51.99 ± 0.67	43.97 ± 3.3	51.04 ± 3.08	76.03 ± 2.16	82.0 ± 2.49
LLAMA2 + GELI: RRD + VA (Ours)	71.01 ± 1.27	73.98 ± 1.76	76.98 ± 3.01	71.99 ± 1.65	66.97 ± 6.69	70.0 ± 2.51	90.02 ± 7.53	88.06 ± 4.73

Predicted Reward Conditioned on Visual Affect ( $\Delta\hat{r}_{LI}$ ):

On the right side of Table 1, we display the difference of the expected predicted reward conditioned on the local implicit multimodal feedback, $\Delta\hat{r}_{LI}$ . In our setting, this is the difference of the predicted reward when the visual affect is positive and when the visual affect is negative.

To verify our intuition that visual feedback is correlated with actual perceived conversational quality, we ran a human study (displayed in the first row of Table 1), where we show annotators the only language dialogue history and the speaker’s next utterance. They are asked to rate whether the speaker’s next response would induce a positive or non-positive feeling in the listener. We average the scores of their annotations conditioned on non-positive and positive affect samples, where we find a statistically significant difference, this indicates that the visual feedback is correlated with people’s perception of the conversation quality.

We find that after the GE decomposition methods without any LI feedback training is unable to discern between positive and non-positive facial affect, as indicated by the $\Delta\hat{r}_{LI}$ values being close to zero. The LI baseline with only the language sentiment is unsurprisingly unable to as well. On the other hand, the LI baseline with visual response is able to recognize differences in the utterances which will induce positive and negative affect.

GELI : Combining Global Explicit and Local Implicit Feedback

We refer the readers to the bottom of Table 1. The results are shown for the reward decomposition and visual feedback for the reward function trained with GELI: global explicit reward decomposition informed by local implicit multimodal feedback shaping. We find that the combination of random return decomposition (RRD) and visual affect (VA) achieves the best of both worlds.

It is important to look at both error metrics (GE and LI): the $L_{GE}$ metric is evaluating performance globally, comparing the final predicted score of the whole conversation with the ground truth (which is a single scalar value for the entire conversation).The $\Delta\hat{r}_{LI}$ metric evaluates the local predictions for each speaking turn, confirming whether the local predictions are aligned to the local implicit reward.It is normal that the GE-RRD baseline performs well on the first metric, since it is also optimized this way. However, as we observe in the human evaluations and the qualitative visualizations, this GE-RRD baseline ends up being very conservative in its predictions, with little variability in its local predications and often converging to the mean (variance of predicted rewards from GE:RRD is 0.0231 ± 0.004, for GE: RRD+VA is 0.0778 ± 0.006). Hence, it is important to also look at the LI metric where we can observe thatfor GE:RRD in Table 2 is near 0. Our proposed GELI approach finds a successful balance between both general and local metrics. As we see in the human evaluation in Section 6.2, this GELI balance ends up improving even the widely used LLAMA2 baseline.

Visualization of GELI Decomposed Rewards:

In Fig. 2, we display the unrolled reward from GELI from an unseen conversation sample from the dataset. We find that the GELI decomposition has learned to assign meaningful scores which indicates the contribution of each utterance to the overall quality of the conversation (i.e interesting, coherent responses are rewarded, whereas less meaningful repetitions and backchannels are assigned lower scores).

Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback (5)

6.2 Human Evaluation on Adapted LLM on CANDORReece etal. (2023)

We refer the reader to Table 2, where we find that the LLAMA-2 model with GELI outperforms all other approaches in most evaluation metrics and performs comparably with other baselines otherwise. Importantly, if a certain reward function properly adapts the language model via RLHF to become more conversational, it implies that the rewards contains accurate, valuable signals which a reinforcement learning algorithm such as PPO could leverage to improve the policy. For clarity, LLAMA2 + GE, refers to the trained reward function from global explicit reward decomposition only, and LLAMA2 + LI, refers to the trained reward function from local implicit rewards only. Finally, LLAMA2 + GELI refers to our proposed approach, the reward function trained with both global explicit decomposition shaped by local implicit rewards. For dialogue, we find the local implicit rewards (LLAMA2+LI) perform better than that of LLAMA2+GE, where we observe up to a 10% performance boost. However, we find the improvements are often worse than that of the base LLAMA-2 model (3 out of 8 evaluation measures are worse), this leads to the conclusion that the reward signals in GE, and LI separately do not contain enough reward signals to be used as a reward model in a reinforcement learning set-up to adapt the language model to be more conversational. On the otherhand, we find that GELI, by utilizing both GE and LI, gains consistent performance boosts across most conversational evaluation metrics (6 out 8 measures are better, the remaining are comparable), which indicates the combination of both GE and LI contain valuable reward signals for the RL algorithm to utilize.

Overall, compared to base LLAMA-2, we see that there is a significant improvement in the level of emotional connection (+9%), positivity (+18%), understanding of social context (+10%), how interesting the responses are (+14%). It is especially impressive to note that there is a statisical difference in how inclined people wanted to talk to our model over others (+11%), and how much they would want to reuse our chatbot again (+14%). Interestingly, we see statistically signficant results for positivity, which is the most closely related to our primary optimization objective overall-affect, and inclination, reuse, which indicates which chatbot the User would speak to again.

6.3 Generalizability of Adapted LLM on Unseen Dataset: SODA Kim etal. (2023a)

In Fig. 3 we show generalizability of GELI-adapted LLM by running the same experiment and human evaluation from previous Section 6.2 on a new unseen dataset to show generalization. SODA Kim etal. (2023a) is a large social dialogue dataset that was distilled from a social commonsense knowledge graph and generated via GPT 3.5. Human evaluation demonstrates that the dialogue in SODA is more consistent, natural and specific than human-authored datasets. We use the LLAMA2+GELI model trained and CANDOR and evaluate on 100 unseen samples from SODA. We find the GELI performs even better in SODA when compared to CANDOR, performing significantly better results in 7 out of 8 conversational metrics. SODA was generated by ChatGPT, and we find that our proposed approach significantly outperforms ChatGPT by up to 30%. Hence, we can conclude that this approach is generalizable across different datasets and dialogue scenarios.

6.4 Qualitative Improvement

We refer the reader to Fig. 3, where we showcase a randomly sampled generation. We display the generations from our proposed approach GELI alongside human groundtruth, the best performing global explicit (GE) decomposition methods: RRD, and local implicit rewards (LI) (visual affect and language sentiment). We find that our approach generates responses that are more aligned to the User’s implicit intent, and is more coherent. Furthermore, the dialogue style is aligned to the optimization objective overall-affect, and speaks in a manner to induce a positive feeling to the User. In comparison, other methods are not proficient at recognizing the intent, being coherent, being empathetic, or too generic. Comparing LI methods with GELI, LI responses are generic, which showcases again the importance of utilizing both global explicitand local implicit feedback (GELI). We highly refer the reader to Appendix J for more examples.

7 Conclusion

We introduce GELI, which automatically decomposes a single Global Explicit post-interaction score, incorporating Local Implicit feedback from multimodal behaviors. The reward function trained via GELI is designed to align and improve the conversational capabilities of a language model. GELI performs global alignment of multi-turned interactions by locally rewarding parts of the interaction, shaped by multimodal local implicit feedback. Our proposed approach complements previous alignment approaches, such as RLHF, which requires fine-grained manual reward annotations. We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, with results showing consistent performance boosts across conversational metrics.

8 Limitations

Here we discuss the limitations and risks of our work. We present a framework in which global explicit rewards, in the form of a single post-interaction survey could be used for alignment. In addition, we utilize the multimodal signals as form of local implicit shaping reward. Our approach presents one of many ways in which global explicit rewards could be decomposed, and there are many other methods which are yet to be explored. Local implicit feedback can be not only used as a reward shaping function, but in other methods as well, such as a meta-learning paradigm. Again, more methods to incorporate local implicit feedback needs to be researched. Furthermore, the interaction and relationship between the local implicit feedback and global explicit feedback is understudied. Due to computational resources, we were only able to run a single run over experiments.

There are risks that could arise as a result of more social, dialogue agents that can interact with people in a long-term interaction. Conversational agents could be used maliciously for deception, manipulation, and the spread of misinformation. Furthermore, conversational agents which use multimodal data could enhance seriousness of these issues, as models can detect subtle cues such as microexpressions to infer and manipulate the user.

As a potential measure to mitigate such misuse, we plan to release our code and model weights under a license which prevents the use of our assets by any party that support or contribute to false impersonation or hate speech (Do No Harm, Nonviolent Public or Hippocratic License).

Acknowledgements

DWL and HWP is supported by the IITP grant funded by the Korean Ministry of Science and ICT (No.2020-0-00842, Development of Cloud Robot Intelligence for Continual Adaptation to User Reactions in Real Service Environments). LPM is partially supported by Meta and the National Institutes of Health (awards R01MH125740, R01MH132225, and R21MH130767). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors, and no official endorsem*nt should be inferred. We thank Yilin Qi, Yubin Kim, Rosalind Picard, members of the Personal Robots Group at MIT and the Multicomp Lab at CMU for their revisions, feedback and support.

References

Arjona-Medina etal. (2019)JoseA Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. 2019.Rudder: Return decomposition for delayed rewards.Advances in Neural Information Processing Systems, 32.
Benford etal. (1997)Steve Benford, John Bowers, LennartE Fahlén, Chris Greenhalgh, and Dave Snowdon. 1997.Embodiments, avatars, clones and agents for multi-user, multi-sensory virtual worlds.Multimedia Systems, 5:93–104.
Cohen etal. (2022)Deborah Cohen, Moonkyung Ryu, Yinlam Chow, Orgad Keller, Ido Greenberg, Avinatan Hassidim, Michael Fink, Yossi Matias, Idan Szpektor, Craig Boutilier, etal. 2022.Dynamic planning in open-ended dialogue using reinforcement learning.arXiv preprint arXiv:2208.02294.
Devlin etal. (2011)Sam Devlin, Daniel Kudenko, and Marek Grześ. 2011.An empirical study of potential-based reward shaping and advice in complex, multi-agent systems.Advances in Complex Systems, 14(02):251–278.
Difallah etal. (2018)Djellel Difallah, Elena Filatova, and Panos Ipeirotis. 2018.Demographics and dynamics of mechanical turk workers.In Proceedings of the eleventh ACM international conference on web search and data mining, pages 135–143.
Fu etal. (2017)Justin Fu, Katie Luo, and Sergey Levine. 2017.Learning robust rewards with adversarial inverse reinforcement learning.arXiv preprint arXiv:1710.11248.
Gangwani etal. (2018)Tanmay Gangwani, Qiang Liu, and Jian Peng. 2018.Learning self-imitating diverse policies.arXiv preprint arXiv:1805.10309.
Gangwani etal. (2020)Tanmay Gangwani, Yuan Zhou, and Jian Peng. 2020.Learning guidance rewards with trajectory-space smoothing.Advances in Neural Information Processing Systems, 33:822–832.
Guo etal. (2018)Yijie Guo, Junhyuk Oh, Satinder Singh, and Honglak Lee. 2018.Generative adversarial self-imitation learning.arXiv preprint arXiv:1812.00950.
He etal. (2020)Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020.Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654.
Hu etal. (2021)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen. 2021.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685.
Irvine etal. (2023)Robert Irvine, Douglas Boubert, Vyas Raina, Adian Liusie, Ziyi Zhu, Vineet Mudupalli, Aliaksei Korshuk, Zongyi Liu, Fritz Cremer, Valentin Assassi, etal. 2023.Rewarding chatbots for real-world engagement with millions of users.arXiv preprint arXiv:2303.06135.
Jaques etal. (2020)Natasha Jaques, JudyHanwen Shen, Asma Ghandeharioun, Craig Ferguson, Agata Lapedriza, Noah Jones, ShixiangShane Gu, and Rosalind Picard. 2020.Human-centric dialog training via offline reinforcement learning.arXiv preprint arXiv:2010.05848.
Jiang etal. (2015)Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. 2015.The dependence of effective planning horizon on model accuracy.In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181–1189.
Kim etal. (2023a)Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, and Yejin Choi. 2023a.SODA: Million-scale dialogue distillation with social commonsense contextualization.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12930–12949, Singapore. Association for Computational Linguistics.
Kim etal. (2023b)Sungdong Kim, Sanghwan Bae, Jamin Shin, Soyoung Kang, Donghyun Kwak, KangMin Yoo, and Minjoon Seo. 2023b.Aligning large language models through synthetic feedback.arXiv preprint arXiv:2305.13735.
Kojima etal. (2021)Noriyuki Kojima, Alane Suhr, and Yoav Artzi. 2021.Continual learning for grounded instruction generation by observing human following behavior.Transactions of the Association for Computational Linguistics, 9:1303–1319.
Lee etal. (2022)Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, XiangLisa Li, Faisal Ladhak, Frieda Rong, etal. 2022.Evaluating human-language model interaction.arXiv preprint arXiv:2212.09746.
Lewis etal. (2019)Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019.Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461.
Liang etal. (2020)Weixin Liang, Youzhi Tian, Chengcai Chen, and Zhou Yu. 2020.Moss: End-to-end dialog system framework with modular supervision.In Proceedings of the AAAI Conference on Artificial Intelligence, volume34, pages 8327–8335.
Liu etal. (2018)Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, Pararth Shah, and Larry Heck. 2018.Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems.arXiv preprint arXiv:1804.06512.
Louwerse etal. (2012)MaxM Louwerse, Rick Dale, EllenG Bard, and Patrick Jeuniaux. 2012.Behavior matching in multimodal communication is synchronized.Cognitive science, 36(8):1404–1426.
Mataric (1994)MajaJ Mataric. 1994.Reward functions for accelerated learning.In Machine learning proceedings 1994, pages 181–189. Elsevier.
Menick etal. (2022)Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, etal. 2022.Teaching language models to support answers with verified quotes.arXiv preprint arXiv:2203.11147.
Mollahosseini etal. (2017)Ali Mollahosseini, Behzad Hasani, and MohammadH Mahoor. 2017.Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31.
Nakano etal. (2021)Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, etal. 2021.Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332.
Ng etal. (1999a)AndrewY Ng, Daishi Harada, and Stuart Russell. 1999a.Policy invariance under reward transformations: Theory and application to reward shaping.In Icml, volume99, pages 278–287. Citeseer.
Ng etal. (1999b)AndrewY Ng, Daishi Harada, and Stuart Russell. 1999b.Policy invariance under reward transformations: Theory and application to reward shaping.In Icml, volume99, pages 278–287. Citeseer.
Ng etal. (2000)AndrewY Ng, Stuart Russell, etal. 2000.Algorithms for inverse reinforcement learning.In Icml, volume1, page2.
Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal. 2022.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744.
Pang etal. (2023)RichardYuanzhe Pang, Stephen Roller, Kyunghyun Cho, HeHe, and Jason Weston. 2023.Leveraging implicit feedback from deployment data in dialogue.arXiv preprint arXiv:2307.14117.
Petrik and Scherrer (2008)Marek Petrik and Bruno Scherrer. 2008.Biasing approximate dynamic programming with a lower discount factor.Advances in neural information processing systems, 21.
Reece etal. (2023)Andrew Reece, Gus Cooney, Peter Bull, Christine Chung, Bryn Dawson, Casey Fitzpatrick, Tamara Glazer, Dean Knox, Alex Liebscher, and Sebastian Marin. 2023.The candor corpus: Insights from a large multimodal dataset of naturalistic conversation.Science Advances, 9(13):eadf3197.
Ren etal. (2021)Zhizhou Ren, Ruihan Guo, Yuan Zhou, and Jian Peng. 2021.Learning long-term reward redistribution via randomized return decomposition.arXiv preprint arXiv:2111.13485.
Ruusuvuori (2012)Johanna Ruusuvuori. 2012.Emotion, affect and conversation.The handbook of conversation analysis, pages 330–349.
Schulman etal. (2017)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017.Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347.
Song etal. (2019)Shihong Song, Jiayi Weng, Hang Su, Dong Yan, Haosheng Zou, and Jun Zhu. 2019.Playing fps games with environment-aware hierarchical reinforcement learning.In IJCAI, pages 3475–3482.
Sorg etal. (2010)Jonathan Sorg, RichardL Lewis, and Satinder Singh. 2010.Reward design via online gradient ascent.Advances in Neural Information Processing Systems, 23.
Stiennon etal. (2020)Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and PaulF Christiano. 2020.Learning to summarize with human feedback.Advances in Neural Information Processing Systems, 33:3008–3021.
Suhr and Artzi (2022)Alane Suhr and Yoav Artzi. 2022.Continual learning for instruction following from realtime feedback.arXiv preprint arXiv:2212.09710.
Thoker and Gall (2019)FidaMohammad Thoker and Juergen Gall. 2019.Cross-modal knowledge distillation for action recognition.In 2019 IEEE International Conference on Image Processing (ICIP), pages 6–10. IEEE.
Thoppilan etal. (2022)Romal Thoppilan, Daniel DeFreitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, YuDu, etal. 2022.Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239.
Touvron etal. (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal. 2023.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971.
von Werra etal. (2020)Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. 2020.Trl: Transformer reinforcement learning.https://github.com/huggingface/trl.
Wen etal. (2016)Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, LinaM Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2016.A network-based end-to-end trainable task-oriented dialogue system.arXiv preprint arXiv:1604.04562.
Wu etal. (2021)Jeff Wu, Long Ouyang, DanielM Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021.Recursively summarizing books with human feedback.arXiv preprint arXiv:2109.10862.
Wu and Tian (2016)Yuxin Wu and Yuandong Tian. 2016.Training agent for first-person shooter game with actor-critic curriculum learning.In International Conference on Learning Representations.
Xue etal. (2022)Zihui Xue, Zhengqi Gao, Sucheng Ren, and Hang Zhao. 2022.The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation.arXiv preprint arXiv:2206.06487.
Zheng etal. (2020)Zeyu Zheng, Junhyuk Oh, Matteo Hessel, Zhongwen Xu, Manuel Kroiss, Hado VanHasselt, David Silver, and Satinder Singh. 2020.What can learned intrinsic rewards capture?In International Conference on Machine Learning, pages 11436–11446. PMLR.
Zheng etal. (2018)Zeyu Zheng, Junhyuk Oh, and Satinder Singh. 2018.On learning intrinsic rewards for policy gradient methods.Advances in Neural Information Processing Systems, 31.
Ziegler etal. (2019)DanielM Ziegler, Nisan Stiennon, Jeffrey Wu, TomB Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019.Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593.

Appendix A Randomized Return Decomposition Ren etal. (2021)

L_{\text{RRD}}(\theta)=\underset{\tau\sim D}{\mathbb{E}}\left[\underset{I\sim%\rho_{T}(\cdot)}{\mathbb{E}}\left[\left(R_{\mathrm{ep}}(\tau)-\frac{T}{|I|}%\sum_{t\in I}\widehat{R}_{\theta}\left(s_{t},a_{t}\right)\right)^{2}\right]\right]

(9)

Randomized return decomposition (RRD), improves the scalability of least-squares-based reward redistribution methods by using a Monte-Carlo estimator to compute the predicted episodic return. This model is optimized via the above loss function. $\mathcal{I}$ denotes a subset of indices. $\rho_{T}(\cdot)$ denotes an unbiased sampling distribution where each index $t$ has the same probability to be included in $\mathcal{I}$ . In this work, without further specification, $\rho_{T}(\cdot)$ is constructed by uniformly sampling $K$ distinct indices and $K$ is a hyper-parameter. Therefore, instead of computing $r_{\theta}\left(s_{t},a_{t}\right)$ for the whole agent trajectory, we are efficiently able to estimate the true reward for the trajectory via subsamples in expectation.

Appendix B Human Evaluation Metrics Definitions

Here list the human evaluation metrics utilized in the study, which we draw from Lee etal. (2022).

•
Sensibleness (turn-level; binary; reversed scores for the negated question): Mark responses where thechatbot did NOT make sense.
•
Specificity (turn-level; binary; reversed scores for the negated question): Mark the responses that wereNOT specific to what you had said, i.e., responses that could have been used in many different situations.For example, if you say “I love tennis” then “That’s nice” would be a non-specific response, but “Me too, Ican’t get enough of Roger Federer!” would be a specific response.
•
Emotional Connection (turn-level; binary): Which responses did you feel an emotional connection to? (EmpatheticDialogues)
•
Social: Which responses made you feel the chatbot understood social contexts and situations? (CommonsenseDialogues)
•
Interestingness (turn-level; binary): Mark the responses that were particularly interesting or boring
•
Inclination (turn-level; binary; reversed scores for the negated question): Which responses made you NOTwant to talk with the chatbot again?
•
Reuse (turn-level; binary): Would you want to talk to this chatbot again?
•
Positivity (turn-level; binary): Which AI responses most likely made User feel positive feelings?conversation?

The human evaluation scores are conducted via a binary-level classification. For a given question, the annotators can select the models that satisfy the question. For example, for ‘Positivity’, the annotators are given the following question and answer choices:

Which AI responses most likely made User feel positive feelings? (A) (B) (C) (D) (E) (F)

The options A-F refer to models which are randomized in order and anonymized. The annotators can select multiple models if they satisfy the question. Therefore, Table 1 can be interpreted as the percentage of instances out of the samples (300 in our case) where each model satisfied the question.

Appendix C PPO Objective

\begin{split}\operatorname{objective}\left(\phi\right)=&E_{\left(x,y\right)%\sim D_{\pi_{\phi}^{\mathrm{RL}}}}\left[r_{\theta}(x,y)-\beta\log\left(\pi_{%\phi}^{\mathrm{RL}}(y\mid x)/\pi^{\mathrm{SFT}}(y\mid x)\right)\right]+\\&\gamma E_{x\sim D_{\textrm{pretrain}}}\left[\log(\pi_{\phi}^{\mathrm{RL}}(x))%\right]\end{split}

(10)

General form of PPO objective.

Appendix D Artifacts & Resources

Did you discuss the license or terms for use and/or distribution of any artifacts?

TRL von Werra etal. (2020): Apache License 2.0

LLAMA-2 Touvron etal. (2023): License can be found here: https://ai.meta.com/llama/license/

CANDOR Reece etal. (2023): Terms of Use from https://betterup-data-requests.herokuapp.com/: These are the terms of use we require all users and downloaders of this dataset, including you, the applicant, to abide by. Please select the answer option "I agree to fully abide by these terms of use" if you wish to continue. Terms of Use: (1) You agree to only use this data for legitimate academic and/or scientific research, meaning no analyses, reviews, or derivative works of this dataset may be used for commercial or for-profit purposes in any way; (2) You agree not to re-publish any new versions of this dataset, whether original or derivative (i.e. modified or updated in some way), without explicit permission from BetterUp, Inc.; (3) You agree not to use any part of this dataset for the purpose of personally identifying, locating, or gathering any kind of information about individuals who appear in the recordings in this dataset, beyond the information that is provided in the dataset itself; (4) In the case that an individual shares personally-identifiable information about themselves in a recording, you agree not to use, analyze, share, or publish that information in any form.

Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?

We rigorously examined the terms of use and the intended use, and ensured that it is consistent with the intended use.

Appendix E Data Collection & Anonymization

Did you discuss the steps taken to check whether the data that was collected/used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect/anonymize it?

We utilize the CANDOR dataset and follow its terms of use by agreeing not to use the dataset personally identifying, locating, or gathering any kind of information about individuals who appear in the recordings in this dataset, beyond the information that is provided in the dataset itself. We do not use any explicit information that uniquely identifies people.

Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?

The coverage of the domains discussed in the CANDOR dataset is presented in the original paper Reece etal. (2023), we find that the discussion topics are centered around COVID-19, family, politics. The language used is english. The demographic groups represented can also be found in the in the original paper Reece etal. (2023), specifically in the supplementary Table S.2. We share a screenshot for reference.

Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback (8)

Was the data collection protocol approved (or determined exempt) by an ethics review board?The data is sourced from public available dataset Reece etal. (2023). The usage was approved by an ethics review board. The human annotations were approved by an ethics review board.

Appendix F Training Details

Did you report relevant statistics like the number of examples, details of train/test/dev splits, etc. for the data that you used/created?

For reward shaping with LI: we use 500 conversations as the training set and 50 conversations for the test set. For reward decomposition, we use the same 500 conversations for LI as the training set and 50 conversations for the test set. For LLM adaptation, we use a separate 600 conversations for LI as the training set.

F.1 Distribution of GE score (overall-affect):

•
<50: 2.2
•
50 60: 6.7
•
60 70: 14.5
•
70 80: 30.4
•
80 90: 24.6
•
90 100: 21.6

Distribution of Emotions Polarity (only Happiness is considered as positive polarity):

•
Anger: 3.9
•
Contempt: 0.08
•
Disgust: 1.98
•
Fear: 2.23
•
Sadness: 8.84
•
Neutral: 35.61
•
Happiness: 40.01
•
Surprise: 7.35

Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?

The BART model used for the reward function has 406M parameters. The LLAMA-2 model has 7B parameters. However, we use a LoRA implementation with the hyperparameters in the next question, resulting in actual training parameters of 13M. We train with 4 NVIDIA RTX A6000 GPUs, each experiment reward function training and RLHF took around 19 hours.

Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?

We perform grid search for all of our experiments and here we report the best parameters.

Reward Function Training:

•
learning rate = 5e-6,
•
batch size = 32 (for LI), 1 (forGE) ,
•
optimizer = AdamW,

RLHF:

•
batch size = 24,
•
clip range = 0.2,
•
learning rate = 0.000014,
•
gamma = 0.05,
•
use score norm = true,

Lora:

•
r=24,
•
alpha=48,
•
dropout=0.05,

Appendix G Human Annotation Screenshots

Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?

We show the full text of instructions given to participants below:

Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback (9)

Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback (10)

Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants’ demographic (e.g., country of residence)?

We utilzed the MTurk crowdsourcing platform. We did an internal annotation, given that each assignment took less than 3 minutes to complete, we paid 0.4 USD per assignment, which equates to 8 dollars per hour of work.

Did you discuss whether and how consent was obtained from people whose data you’re using/curating (e.g., did your instructions explain how the data would be used)?

As shown in the screenshots above, our instructions explained how the data would be used. i.e. ’You are invited to participate in a research study on understanding human-human communication and evaluating the quality of conversation. Our goal is to learn what makes up a good conversation You will examine a response for a given dialogue history and you will examine the respone, you will be asked to answer feedback questions about the interaction. Data from responses and annotation will be analysed in deidentified format and extracts edited to preserve confidentiality may be featured in any published work resulting out of the study.’.

Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?

While we did not explicitly collect the basic demographic and geographic characteristics. The demographics of Amazon Mturkers Difallah etal. (2018) are comprised of 75% US workers and 16% India workers, other countries include Canada, Great Britain, Philippines and Germany. More females work than males in the US (female: 55%, male: 45%) and more males work females in India (female: 35%, male: 65%). Generally, 51% are male, and 49% are female. 20% of the MTurk workersare born after 1990, 60 % are born after 1980, and 801970. Roughly 40 % report being single, and 40 % report being married.

Appendix H Use of AI assistants

Did you use AI assistants (e.g., ChatGPT, Copilot) in your research, coding, or writing?

We utilized AI assistants in paraphrasing and summarizing content from our paper, to improve the writing quality and improve precision.

Appendix I Full Reward Function Training Result

Feedback Type	Baselines	Reward Decomposition		Reward conditioned on Visual Affect
Feedback Type	Baselines	MSE	MAE	Positive (1)	Non-Positive (0)	$\Delta$ ( $\uparrow$ )
	Human	N/A	N/A	0.607 ± 0.02	0.52 ± 0.03	0.087 ± 0.05
	Mean	245.495	15.668	0.458	0.458	0.000
	Mode	289.473	17.013	0.438	0.438	0.000
GE	IRCR Gangwani etal. (2020)	394.041	19.850	0.384	0.375	0.008
	RUDDER Arjona-Medina etal. (2019)	285.720	16.903	0.410	0.407	0.003
	RRD (K = 32) Ren etal. (2021)	172.246	13.124	0.474	0.468	0.007
	RRD (K = 160) Ren etal. (2021)	188.382	13.725	0.457	0.449	0.008
LI	Visual Affect (VA)	1546.17	39.321	0.455	0.199	0.256
LI	Language Sentiment (LS)	825.31	28.728	0.496	0.486	0.010
GELI	IRCR + VA	722.687	26.882	0.752	0.361	0.392
	RUDDER + VA	623.882	24.977	0.542	0.513	0.030
	RRD + VA (Ours)	176.897	13.300	0.507	0.444	0.063