[ad_1]
Though the overwhelming majority of our explanations rating poorly, we imagine we will now use ML methods to additional enhance our capability to supply explanations. For instance, we discovered we have been capable of enhance scores by:
- Iterating on explanations. We are able to enhance scores by asking GPT-4 to give you potential counterexamples, then revising explanations in mild of their activations.
- Utilizing bigger fashions to offer explanations. The typical rating goes up because the explainer mannequin’s capabilities enhance. Nonetheless, even GPT-4 offers worse explanations than people, suggesting room for enchancment.
- Altering the structure of the defined mannequin. Coaching fashions with totally different activation features improved clarification scores.
We’re open-sourcing our datasets and visualization instruments for GPT-4-written explanations of all 307,200 neurons in GPT-2, in addition to code for clarification and scoring utilizing publicly obtainable fashions on the OpenAI API. We hope the analysis group will develop new methods for producing higher-scoring explanations and higher instruments for exploring GPT-2 utilizing explanations.
We discovered over 1,000 neurons with explanations that scored a minimum of 0.8, that means that based on GPT-4 they account for a lot of the neuron’s top-activating conduct. Most of those well-explained neurons should not very attention-grabbing. Nonetheless, we additionally discovered many attention-grabbing neurons that GPT-4 did not perceive. We hope as explanations enhance we could possibly quickly uncover attention-grabbing qualitative understanding of mannequin computations.
[ad_2]