Norway’s 2 petabytes of Huawei flash storage and LLM training

Visit Original
4 min read718 wordseng
Norway’s National Library is building a Norwegian sovereign LLM from its 20 PB cultural archive under a newspaper copyright deal, using 2 PB Huawei flash to clean and prep data before training on Sigma2 Olivia, with quality and pipeline throughput as the main bottlenecks.
  • Norway’s National Library is building a sovereign large language model that understands Norwegian, because no commercial provider is creating a local-language LLM.
  • The Ministry of Culture assigned the library this role because it holds Norway’s largest digital collection of books, newspapers, web pages, and other cultural heritage materials.
  • A copyright agreement with Norwegian newspapers allows LLM training on copyrighted content, which the library says private companies do not have.
  • The library has digitized its collection since 2005 and stores 20 PB of unique data in a 3-2-1 preservation setup, totaling about 60 PB overall.
  • Its AI training pipeline uses 2 PB of Huawei OceanStor Dorado flash storage for low-latency data ingestion, cleaning, deduplication, normalization, validation, and preparation.
  • After preprocessing, the data is sent to Norway’s national supercomputer, Sigma2 Olivia, for training runs.
  • Husnes said the main challenges are data quality and pipeline throughput, plus unresolved issues around evaluation, governance, and orchestration across preservation, on-prem AI, and supercomputing systems.