4 min read718 wordseng
Norway’s National Library is building a Norwegian sovereign LLM from its 20 PB cultural archive under a newspaper copyright deal, using 2 PB Huawei flash to clean and prep data before training on Sigma2 Olivia, with quality and pipeline throughput as the main bottlenecks.
- • Norway’s National Library is building a sovereign large language model that understands Norwegian, because no commercial provider is creating a local-language LLM.
- • The Ministry of Culture assigned the library this role because it holds Norway’s largest digital collection of books, newspapers, web pages, and other cultural heritage materials.
- • A copyright agreement with Norwegian newspapers allows LLM training on copyrighted content, which the library says private companies do not have.
- • The library has digitized its collection since 2005 and stores 20 PB of unique data in a 3-2-1 preservation setup, totaling about 60 PB overall.
- • Its AI training pipeline uses 2 PB of Huawei OceanStor Dorado flash storage for low-latency data ingestion, cleaning, deduplication, normalization, validation, and preparation.
- • After preprocessing, the data is sent to Norway’s national supercomputer, Sigma2 Olivia, for training runs.
- • Husnes said the main challenges are data quality and pipeline throughput, plus unresolved issues around evaluation, governance, and orchestration across preservation, on-prem AI, and supercomputing systems.