Alignment Research Center

ARC's mission is to ensure that powerful machine learning systems of the future are designed and developed safely and for the benefit of humanity. It was founded in April 2021 by Paul Christiano and other researchers focused on the theoretical challenges of AI alignment. They attempt to develop scalable methods for training AI systems to behave honestly and helpfully. A key part of their methodology is considering how proposed alignment techniques might break down or be circumvented as systems become more advanced. ARC has been expanding from theoretical work into empirical research, industry collaborations, and policy. In March 2022, the ARC received $265,000 from Open Philanthropy. After the bankruptcy of FTX, ARC said it would return a $1.25 million grant from disgraced cryptocurrency financier Sam Bankman-Fried's FTX Foundation, stating that the money "morally (if not legally) belongs to FTX customers or creditors." In 2022, Beth Barnes joined ARC from OpenAI to start ARC Evals, a team working on "evaluating the capabilities and alignment of advanced AI models". In December 2023, ARC Evals was spun out as METR, an independent nonprofit. In March 2023, OpenAI asked the ARC to test GPT-4 to assess the model's ability to exhibit power-seeking behavior. ARC evaluated GPT-4's ability to strategize, reproduce itself, gather resources, stay concealed within a server, and execute phishing operations. As part of the test, GPT-4 was asked to solve a CAPTCHA puzzle. It was able to do so by hiring a human worker on TaskRabbit, a gig work platform, deceiving them into believing it was a vision-impaired human instead of a robot when asked. ARC determined that GPT-4 responded impermissibly to prompts eliciting restricted information 82% less often than GPT-3.5, and hallucinated 60% less than GPT-3.5. == See also ==