Skip to content

Openlogic:開放原始碼大數據基礎建設 – 資料儲存、挖掘與視覺化的關鍵技術

大數據基礎建設是指支援收集、管理和分析海量資料的系統(硬件、軟件、網絡元件)和流程。 處理來自多個來源且持續湧入的大量資料的公司,通常仰賴開源大數據框架(如 Hadoop、Spark)、資料庫(如 Cassandra)和串流處理平台(如 Kafka)作為其大數據基礎建設的基石。

本篇文章將探討開源大數據技術堆疊中,資料儲存、處理、挖掘和視覺化等方面常用的技術與方法。

資料儲存與處理(Data Storage and Processing)

大數據儲存的首要目標是妥善儲存海量資料,以供日後分析與使用。 企業需要一個可擴展的架構,以便即時收集、管理和分析龐大的資料集。大數據儲存解決方案旨在應對大型資料集的速度、數量和複雜性挑戰。 這些解決方案包括資料湖、資料倉儲和資料管道,它們可以部署在雲端、本地或異地實體位置(稱為共置儲存)。

  • 資料湖(Data Lakes)
    資料湖是一種集中式儲存解決方案,可在沒有大小限制的情況下,以原始格式處理和保護資料。 資料湖支援各種智能分析應用,例如機器學習和資料視覺化。
  • 資料倉儲(Data Warehouses)
    資料倉儲將來自不同來源的資料集合成到單一儲存單元,以進行強大的分析、資料探勘、人工智能等應用。 與資料湖不同,資料倉儲採用三層架構來儲存資料。
  • 資料管道(Data Pipelines)
    資料管道從一個或多個來源收集原始資料,並可能進行合併和轉換,然後將其傳輸到資料湖或資料倉儲等其他位置。

相關技術

無論資料儲存在何處,任何大數據技術堆疊的核心都是處理框架。 Apache Hadoop 是一個著名的開源範例,它允許跨電腦叢集對大型資料集進行分散式處理。 儘管 Hadoop 已經存在很長時間,但它仍然很流行,尤其適用於非雲端解決方案。 Hadoop 可以與 Hive 或 HBase 等其他開源資料技術無縫整合,以更全面地滿足企業需求。

資料探勘(Data Mining)

資料探勘是指從大型資料集中篩選、排序和分類資料,以揭示模式和關係的過程,幫助企業透過資料分析識別和解決複雜的業務問題。機器學習(ML)、人工智能(AI)和統計分析是資料探勘的關鍵要素,用於仔細檢查、排序和準備資料,以便進行更深入的分析。 先進的機器學習演算法和人工智能工具,使探勘大量資料集變得更加容易,這些資料集包括客戶資料、交易記錄,甚至從感測器、致動器、物聯網設備、流動應用程式和伺服器收集的日誌檔案。

每個資料科學應用程式都需要不同的資料探勘方法。 模式識別和異常檢測是兩個最常見的應用,它們結合了多種技術來探勘資料。以下是一些各行業常用的基本資料探勘技術。

  • 關聯規則(Association Rule)
    關聯規則是指建立兩個或多個資料項目之間的關聯和關係的語句。關聯性透過支援度和置信度指標進行評估,其中支援度確定資料項目在資料集中出現的頻率,而置信度則與語句的準確性相關。例如,在追蹤客戶網上購物行為時,觀察到客戶購買咖啡包時通常會購買餅乾。 在這種情況下,關聯規則建立了兩個項目(餅乾和咖啡包)之間的關係,並在客戶將咖啡包添加到購物車時預測未來的購買行為。
  • 分類(Classification)
    分類資料探勘技術將資料集中的資料項目分類到不同的類別。例如,可以根據車輛的形狀、車輪類型甚至座位數等屬性,將車輛分為轎車、掀背車、汽油車、柴油車、電動車等不同類別。當新車輛出現時,可以根據識別出的車輛屬性將其歸類到各個類別中。同樣的分類策略可以應用於根據年齡、地址、購買歷史和社會群體等因素對客戶進行分類。
  • 分群(Clustering)
    分群資料探勘技術將資料元素分組到具有共同特徵的叢集中。透過識別一個或多個屬性,資料片段會被分群到類別中。常見的分群技術包括 k 平均演算法、階層式分群和高斯混合模型。
  • 迴歸(Regression)
    迴歸是一種統計建模技術,它使用先前的觀察結果來預測新的資料值。換句話說,它是一種基於一組定義變數的預測資料值來確定資料元素之間關係的方法。這種分類器稱為「連續值分類器」。
  • 序列與路徑分析(Sequence & Path Analysis)
    序列資料探勘可以識別模式,其中特定事件或資料值會導致未來發生其他事件。這種技術適用於長期資料分析,因為序列分析是識別趨勢或某些事件定期發生的關鍵。例如,當客戶購買一件商品時,您可以使用序列模式根據客戶的購買模式推薦或新增其他商品到購物車。
  • 神經網絡(Neural Networks)
    神經網絡是指模仿人腦並嘗試複製其活動以完成目標或任務的演算法。它們用於多種模式識別應用程式,這些應用程式通常涉及深度學習技術。神經網絡是先進機器學習研究的成果。
  • 預測(Prediction)
    預測資料探勘技術通常用於預測事件的發生,例如機器故障、工業元件故障、欺詐事件或公司利潤超過特定門檻。預測技術與其他探勘方法結合使用,可以幫助分析趨勢、建立關聯和進行模式匹配。透過這種探勘技術,資料探勘者可以分析過去的事件來預測未來的事件。

相關技術

在資料探勘任務方面,Spark、YARN 或 Oozie 等開源技術是使用靈活且強大的 MapReduce 和批處理技術的出色引擎。

資料視覺化(Data Visualization)

資料視覺化是將資訊和資料以圖形方式呈現的技術。透過圖表、圖形和地圖等視覺元素,資料視覺化工具提供了一種淺顯易懂的方式來查看和理解資料中的趨勢、異常值和模式。隨著越來越多的企業依賴大數據來制定關鍵業務決策,視覺化已成為解讀每天產生之海量資料的關鍵工具。

好的資料視覺化能將資料轉化為易於理解的媒介,有效地講述資料背後的故事。它能去除資料中的雜訊,突顯有用的資訊,例如趨勢和異常值。然而,資料視覺化並非只是單純地美化圖表或將資訊堆砌在圖表上。有效的資料視覺化需要在形式和功能之間取得微妙的平衡。樸素的圖表可能因為缺乏吸引力而被忽略,但也可能傳達強而有力的訊息;同樣地,令人驚豔的視覺化作品也可能無法有效傳達正確的訊息,或者反而造成資訊混淆。資料和視覺效果需要相輔相成,將出色的分析與引人入勝的敘事手法結合,才能創造出一門藝術的視覺化作品。

相關技術

Grafana 是一款能滿足上述需求的開源軟件,它包含了所有基本的視覺化元素。透過 Grafana 等工具,企業可以有效監控其大數據實施情況,並利用資料視覺化結果制定明智的決策、提升系統效能和簡化故障排除流程。

總結

雖然本文涵蓋了大數據基礎建設的一些基本概念,但大數據領域的知識博大精深,一篇文章難以窮盡。此外,實施和維護大數據基礎建設需要高超的技術能力。由於相關技術的複雜性,許多缺乏內部專業知識的企業會尋求第三方廠商提供商業支援或大數據平台管理服務。投資大數據平台可以帶來豐厚的回報,但前提是必須有完善的大數據策略支持,並由具備必要技能和經驗的專業人員管理。

關於 OpenLogic
OpenLogic 由 Perforce 提供完整的企業級支援和服務,專為在其基礎設施中使用開源軟件的公司企業而設計。我們支援超過 400 種開源技術,提供保證的服務水準協議(SLA),並可直接與經驗豐富的企業架構師溝通。透過我們的 24×7 工單支援、專業服務和培訓,OpenLogic 提供綜合且全面的開源支援解決方案。

關於 Version 2 Digital
資安解決方案 專業代理商與領導者
台灣二版 ( Version 2 ) 是亞洲其中一間最有活力的 IT 公司,多年來深耕資訊科技領域,致力於提供與時俱進的資安解決方案 ( 如EDR、NDR、漏洞管理 ),工具型產品 ( 如遠端控制、網頁過濾 ) 及資安威脅偵測應 變服務服務 ( MDR ) 等,透過龐大銷售點、經銷商及合作伙伴,提供廣被市場讚賞的產品及客製化、在地化的專業服務。

台灣二版 ( Version 2 ) 的銷售範圍包括台灣、香港、中國內地、新加坡、澳門等地區,客戶涵 蓋各產業,包括全球 1000 大跨國企業、上市公司、公用機構、政府部門、無數成功的中小企業及來自亞 洲各城市的消費市場客戶。

OpenLogic:如何為您的公司企業挑選最佳 Linux 發行版

「哪個是最好的 Linux 發行版?」

或許更值得問的是:「哪個 Linux 發行版能滿足我企業當前及未來擴展的需求?」

隨著 CentOS Linux 生命週期終止,市場競爭格局已然改變,許多可行的替代方案浮出水面。本文將為您概覽 CentOS 結束支持後的 Linux 生態,比較當今最受矚目的企業級 Linux 發行版,並點出它們的關鍵差異。閱讀時,建議您思考團隊在管理 Linux 基礎設施方面的資源與專業能力,並綜合考量成本、穩定性及安全性等因素。

對於正從 CentOS 遷移的用戶來說,發行版的長期發展尤為重要。對項目未來方向的信心、社區的活躍程度、管理模式(例如營利企業的控制權比重)—— 這些都是您在挑選適合企業的開源 Linux 發行版時,應當權衡的關鍵點。

開源 Linux 發行版的種類

Linux 發行版結合了開源 Linux 內核與一系列輔助軟件,幫助用戶開發與運行應用程式。開源社區會依據優先支援的使用場景,決定納入哪些軟件包。例如,桌面導向的發行版可能包含媒體播放器或界面客製化工具;而企業級發行版則更專注於安全性、穩定性與效能,以滿足關鍵任務應用程式的需求。

開源 Linux 發行版可依據不同標準分類,例如管理主體(社區或商業公司)、發佈模式(滾動更新或固定版本),以及上游來源(例如 Fedora、Debian)。

社區與商業(Community vs. Commercial)

社區支援的 Linux 發行版免費使用,由個人貢獻者組成的社群維護。這些志願者投入時間與專業知識,負責項目更新,包括安全修補程式、錯誤修復及新版本的發布。

相對地,商業企業級 Linux 發行版由軟件供應商提供,基於開源組件而設,需付費訂閱。雖然其核心功能與社區版本一致,但用戶能獲得技術支援,並常附帶專有企業功能或工具。

滾動更新與固定版本(Rolling Release vs. Fixed Release)

滾動更新模式下,新功能與更新會持續、逐步推出,而非按固定時程打包成新版本。Linux 內核、函式庫、工具等軟件包一經準備就緒即刻發布,無需等待特定日期。這種「細水長流」的更新方式,讓滾動發行版用戶無需進行大規模版本升級,相較固定版本發行版,能更快發現並修復問題、漏洞。

滾動更新吸引那些追求最新軟件與功能的用戶,但穩定性可能稍遜。它要求用戶積極管理系統,隨時應對更新帶來的潛在問題。滾動模式可能因未經完整測試而導致軟件版本衝突,甚至新功能的細微變化也可能影響應用程式運作。因此,許多企業在關鍵應用場景中更青睞固定版本模式,以確保穩定性。

上游來源(Upstream Source)

發行版可能源自 Fedora、RHEL(其本身基於 Fedora)、Debian、SUSE 等不同生態系統。每個生態有其優勢,選擇時或許取決於團隊熟悉度與其他因素(例如,若您已是 Oracle 用戶,Oracle Linux 可能更具吸引力)。

接下來,我們將深入探討部分發行版,按其上游來源分類,從 Fedora 開始。

註:帶星號(*)的發行版目前由 OpenLogic 提供支援。

Fedora 與 RHEL 系發行版

Fedora*

Fedora 是知名的社區支援發行版,以推陳出新的技術、開源協作聞名。它為桌面與伺服器用戶提供平台,兼顧最新軟件與穩定性。Fedora 用戶熱衷於探索技術前沿、參與開源項目並體驗創新功能。通常每年春季與秋季各發布一個新版本。

CentOS Stream*

CentOS Stream 被視為 RHEL 的「滾動預覽版」,是 Fedora 與 RHEL 之間的橋樑,採用 Red Hat 用於下一代 RHEL 的相同源代碼。目前版本為 CentOS Stream 10,領先於 RHEL 10(以及下游的 Rocky Linux、AlmaLinux 等)。選擇 CentOS Stream 取決於您對 Linux 生態的整體偏好。RHEL / CentOS 生態中的軟件包管理與虛擬化選項在 Stream 中依然適用,且錯誤修復與安全修補程式的更新速度快於已停用的 CentOS Linux。若您對滾動更新模式猶豫不決,這份 CentOS Stream 遷移指南是不錯的參考。

Red Hat Enterprise Linux(RHEL)

RHEL 是廣受認可的商業企業級發行版,以穩定性、長效支援與完整生態著稱。它提供多種版本,針對伺服器、雲端、容器等不同場景。RHEL 基於 CentOS Stream 的快照建構,凍結軟件版本,僅在該版本基礎上應用安全更新,確保穩定與安全。Red Hat(現隸屬 IBM)為 RHEL 提供專業支援,但其授權費用與年費對部分企業而言可能偏高。如同其他商業軟件,供應商鎖定風險也需考量。

CentOS Linux(已停用)*

令社群意外與失望的是,CentOS 8 在發布僅兩年後的 2021 年提前終止,CentOS 7 則於 2024 年結束支援。當時掌控項目的 Red Hat 宣布終止 CentOS Linux,轉而聚焦 CentOS Stream。此舉催生了基於 RHEL 源代碼的新發行版,如 Rocky Linux 與 AlmaLinux,填補空缺。遷移與退役環境可能耗時數月甚至數年,CentOS 長期支援方案可為企業爭取時間,評估替代發行版並轉移其 EOL 部署。

Rocky Linux*

Rocky Linux 由 CentOS 創始人之一發起,是社區支援的熱門 CentOS 替代品。它承諾與 RHEL 逐一兼容,旨在為依賴 CentOS 的企業與用戶提供穩定、可靠的伺服器平台。

AlmaLinux*

與 Rocky Linux 類似,AlmaLinux 是因 CentOS Linux 停用而推出的社區支援發行版,與 RHEL 二進制兼容,應用程式運行無縫銜接。

Oracle Linux*

Oracle Linux 由 Oracle 打包分發,是另一個與 RHEL 二進制兼容的重建版。它經過改良,與 Oracle 其他軟件協同運作,適合運行 Oracle 數據庫等應用。雖然有人擔憂 Oracle 可能如 2019 年對 OracleJDK 一樣開始收費,但目前它仍免費,並提供與 RHEL 價格相近的 SLA 商業支援。

基於 Debian 的發行版

Debian Linux*

Debian 以堅守開源原則、穩定性與豐富的軟件包管理聞名,是 Ubuntu、Linux Mint 等發行版的基礎。它廣泛應用於桌面與伺服器環境,適合尋求可靠、可客製化發行版的用戶,涵蓋嵌入式系統等多樣場景。

Debian Testing

Debian 提供測試分支,介於不穩定與穩定版之間,適合希望兼顧新軟件與相對穩定性的用戶。測試版比穩定版更早獲得新功能與修復,但可能需自行解決潛在問題,以換取最新特性。

Ubuntu 社區版*

Ubuntu 以用戶友好、強大生態與活躍社群著稱,廣泛應用於桌面、伺服器與企業場景。它與 Debian 同樣採用 apt 軟件包管理,並內建多個 AI 相關套件。

Ubuntu Pro

Ubuntu Pro 是 Ubuntu 的商業版,以易用性、定期更新與雲端兼容性見稱。它針對桌面、伺服器、物聯網與雲端提供改良版本,吸引前端開發者,帶來豐富編程資源與 AI 函式庫。

Linux Mint

Linux Mint 致力於為新手與老手提供穩定、友好的體驗。基於 Ubuntu 與 Debian,它增添額外功能與設計,強調便利性,提供傳統桌面與高度客製化選項,特別適合從 Windows 轉移的用戶。

SUSE 系發行版

OpenSUSE Leap*

OpenSUSE Leap 由社群驅動,結合固定版本穩定性與最新軟件包,適用於桌面與伺服器。它穩定且隨時可用,熟悉 SLES、SUSE 生態的用戶尤感親切,注重部署簡易與雲端適配。

OpenSUSE Tumbleweed*

Tumbleweed 是 OpenSUSE 的滾動更新版,錯誤修復與安全修補程式來得更早,但部分功能可能尚未成熟。它支援多樣桌面環境與工具。

SUSE Linux Enterprise Server(SLES)

SLES 是 OpenSUSE 的商業版,由德國企業 SUSE 提供支援,專注於可靠性與高效能,支援 Systemd、Btrfs 與容器,適合伺服器與虛擬化場景。

其他開源發行版

Arch Linux

Arch Linux 輕量且高度可客製,採滾動更新,強調簡單與 DIY,適合經驗豐富的用戶建構專屬系統,深受開發者與愛好者青睞。

Alpine Linux

Alpine Linux 輕巧且安全導向,專為容器化與資源效率設計,適用於容器、物聯網與嵌入式系統,強調快速啟動與低資源佔用。

Amazon Linux

Amazon Linux 為 AWS 而叔設,適用於 EC2,提供預設 AMI,現基於 CentOS Stream,源代碼開源。

結語

選擇最適合企業的 Linux 發行版需時間與研究。考量各發行版的業務價值與實施挑戰至關重要,包括使用場景、技能需求與學習曲線。軟件包管理、生態兼容性與鎖定風險也需評估。

若想避免鎖定又確保支援,可考慮與 OpenLogic 合作。我們提供 SLA 保障的企業級 Linux 支援,每個客戶由 15 年以上經驗的架構師處理,並提供從諮詢到執行的遷移服務。

關於 OpenLogic
OpenLogic 由 Perforce 提供完整的企業級支援和服務,專為在其基礎設施中使用開源軟件的公司企業而設計。我們支援超過 400 種開源技術,提供保證的服務水準協議(SLA),並可直接與經驗豐富的企業架構師溝通。透過我們的 24×7 工單支援、專業服務和培訓,OpenLogic 提供綜合且全面的開源支援解決方案。

關於 Version 2 Digital
資安解決方案 專業代理商與領導者
台灣二版 ( Version 2 ) 是亞洲其中一間最有活力的 IT 公司,多年來深耕資訊科技領域,致力於提供與時俱進的資安解決方案 ( 如EDR、NDR、漏洞管理 ),工具型產品 ( 如遠端控制、網頁過濾 ) 及資安威脅偵測應 變服務服務 ( MDR ) 等,透過龐大銷售點、經銷商及合作伙伴,提供廣被市場讚賞的產品及客製化、在地化的專業服務。

台灣二版 ( Version 2 ) 的銷售範圍包括台灣、香港、中國內地、新加坡、澳門等地區,客戶涵 蓋各產業,包括全球 1000 大跨國企業、上市公司、公用機構、政府部門、無數成功的中小企業及來自亞 洲各城市的消費市場客戶。

Running Kafka Without ZooKeeper in KRaft Mode

ZooKeeper will be completely gone in the next major Apache Kafka release (Kafka 4), and replaced by Kafka Raft, or KRaft mode. Many developers are excited about this change, but it will impact teams currently running Kafka with ZooKeeper who need to determine an upgrade path once ZooKeeper is no longer supported.

In this blog, our expert explains what KRaft mode is and how Raft implementations differ from ZooKeeper-based deployments, what to consider when planning your KRaft migration, and how your environment will look different when you’re running Kafka without ZooKeeper.

Note: This blog was originally published in 2022 and was updated and revised in 2025 to reflect the latest developments.

 

What Is Kafka Raft (KRaft) Mode?

Kafka Raft (which loosely stands for Reliable, Replicated, Redundant, And Fault Tolerant) or KRaft, is Kafka’s implementation of the Raft consensus algorithm.

Created as an alternative to the Paxos family of algorithms, the Raft Consensus protocol is meant to be a simpler consensus implementation with the goal of being easier to understand than Paxos. Both Paxos and Raft operate in similar manner under normal stable operating conditions, and both protocols accomplish the following:

  • Leader writes operation to its log and requests following servers to do the same thing
  • The operation is marked as “commited” once a majority of servers acknowledge the operation

This results in a consensus-based change to the state machine, or in this specific case, the Kafka cluster.

The main difference between Raft and Paxos, however, is when operations are not normal and new leader must be elected. Both algorithms will guarantee that the new leader’s log will contain the most up-to-date commits, but how they accomplish this process differs.

In Paxos, the leader election contains not only the proposal and subsequent vote, but also must contain any missing log entries the candidate is missing. Followers in Paxos implementations can vote for any candidate and once the candidate is elected as leader, the new leader will utilize this data to update its log to maintain currency.

In Raft, on the other hand, followers will only vote for a candidate if the candidate’s log is the at least as up to date as the follower’s log. This means only the most up-to-date candidate will be elected as leader. Ultimately, both protocols are remarkably similar in their approach to solving the consensus problem. However, with Raft making some base assumptions about the data, namely the order of commits in the log, we can see improvements in election efficiency in Raft.

What does this mean in regards to Kafka? From the protocol side of things, not much. ZooKeeper utilizes a proprietary consensus protocol called ZAB (ZooKeeper Atomic Broadcast) that is much more focused on total ordering of commits to the change log. This focus on commit order makes Raft consensus fit quite well into the Kafka ecosystem.

That said, changes from an infrastructure perspective will be quite noticeable. With each broker having the Kraft logic incorporated into the base code, ZooKeeper nodes will no longer be part of the Kafka infrastructure. Note that this doesn’t necessarily mean less servers in the production environment — more on that later.

 

Why Is Kafka Raft Replacing ZooKeeper?

To understand why the Kafka community leadership decided to make this move away from ZooKeeper, we can look directly at KIP-500 for their reasoning. In short, this move was meant to reduce complexity and handle cluster metadata in a more robust fashion. Removing the requirement for ZooKeeper means there is no longer a need to deploy two distinctly different distributed systems. ZooKeeper has different deployment patterns, management tools, and configuration syntax when compared to Kafka. Unifying the functionality to single system will reduce configuration errors and overall operational complexity.

In addition to simpler operations, treating the metadata as its own event stream means that a single number, an offset, can be used to describe a cluster member’s position and be quickly brought up to date. This in effect applies the same principles used for producers and consumers to the Kafka cluster members themselves.

Get the Decision Maker’s Guide to Apache Kafka

This white paper has everything you need to master Kafka’s complexity, from partition strategies and security best practices to tips for running Kafka on K8s.

DownLoad Guide

 

KRaft vs. ZooKeeper

In a ZooKeeper-based Kafka deployment, the cluster consists of several broker nodes and a quorum of ZooKeeper nodes. In this environment, each change to the cluster metadata is treated as an isolated event, with no relationship to previous or future events. When state changes are pushed out to the cluster from the cluster controller, a.k.a. the broker in charge of tracking/electing partition leadership, there is potential for some brokers to not receive all updates, or for stale updates to create race conditions as we’ve seen in some larger Kafka installations. Ultimately, these failure points have the potential to leave brokers in divergent states.

While not entirely accurate, as all broker nodes can (and do) talk to ZooKeeper, the image below is a basic example of what that looks like:


In contrast, the metadata in KRaft is stored within Kafka itself and ZooKeeper is effectively replaced by a quorum of Kafka controllers. The controller nodes comprise a raft quorum to elect the active controller that manages the metadata partition. This log contains everything that used to be found ZooKeeper: topic, partition, ISRs, configuration data, etc. will all be located in this metadata partition.

Using the Raft algorithm controller nodes will elect the leader without the use of an external system like ZooKeeper. The leader, or active controller, will be the partition leader for the metadata partition and will handle all RPCs from the brokers.

Learn more about Kafka partitions >>

The diagram below is a logical representation of the new cluster environment implementation using KRaft:


Note in the diagram above there is no longer a double-sided arrow. This denotes another major difference in the two environments: Instead of the controller sending updates to the brokers, controllers fetch the metadata via a MetadataFetch API. In similar fashion to a regular fetch request, the broker will track the offset of the latest update it fetched, requesting only newer updates from the active controller persisting that metadata to disk for faster startup times.

In most cases, the broker will only need to request the deltas of the metadata log. However, in cases where no data exists on a broker or a broker is too far out of sync, a full metadata set can be shipped. A broker will periodically request metadata and this request will act as a heartbeat as well.

Previously, when a broker entered or exited a cluster, this was kept track of in ZooKeeper, but now the broker status will be registered directly with the active controller. In a post-ZooKeeper world, cluster membership and metadata updates are tightly coupled. Failure to receive metadata updates will result in eviction from the cluster.

ZooKeeper Deprecation and Removal

KRaft has been considered “production ready” since Kafka 3.3 and ZooKeeper was officially deprecated in Kafka 3.5. It will be removed completely in Kafka 4 and higher.

 

Running Kafka Without ZooKeeper

As organizations plan their migrations to KRaft environments, there are quite a few things to consider. In a KRaft-based cluster, Kafka nodes can be run in one of three different modes know as process.role. The process.role can be set to broker, controller, or combined. In a production cluster, it is recommended that the combined process.role should be avoided — in other words, having dedicated JVM resources assigned to brokers and controllers. So, as mentioned previously, doing away with ZooKeeper doesn’t necessarily mean doing away with the compute resources in production. Using the combined process.role in develop or staging environments is perfectly acceptable.

Since we originally published this blog, several upgrades and changes to the KRaft implementation have been completed and released. The list of caveats previously mentioned have largely been addressed, including:

  • Configuring SCRAM users via the administrative API: With the completion and implementation of KIP-900 in Kafka 3.5.0 for inter-broker communications, the kafka-storage tool can be used as a mechanism to configure SCRAM.
  • Supporting JBOD configurations with multiple storage directories: JBOD support was introduced in 3.7 as an early access feature. With the completion of KIP-858 and its implementation in 3.8, that is no longer the case.
  • Modifying certain dynamic configurations on the standalone KRaft controller: In early releases of Kafka KRaft, some configuration items could not be updated dynamically, but as far as we are aware, these have mostly been fixed.  The “missing features” verbiage should be removed with 4.0 (see conversation here).
  • Delegation tokens: KIP-900 also paved the way for “delegation token” support.  With the completion of KAFKA-15219 in 3.6, delegation tokens are now supported in KRaft mode.

 

KRaft Migration

Although a fully-fledged and supported upgrade path has been implemented and can be used to move clusters from Zookeeper mode to KRaft mode, in-place upgrades generally should be avoided. At OpenLogic, we typically recommend lift-and-shift style, blue-green deployments instead. However, given the complexity of some Kafka clusters, having an official migration path is very much a welcome tool in the tool belt.

While detailing the KRaft migration process would require an entire series of blog posts, you can find an overview of the process in the Kafka documentation here. Of particular interest, though, is the requirement to upgrade to Kafka 3.9.0. The metadata version cannot be upgraded during the migration, so inter.broker.protocol.version must be at 3.9 prior to the migration. So, even if your organization isn’t planning on migrating to KRaft anytime soon, it still makes sense to plan your upgrade to 3.9 sooner rather than later.

 

KRaft Mode FAQ

What benefits would my organization see, if any, from migrating to KRaft?

The most basic benefit for any organization is of course being able to stay up to date on your software versions. With ZooKeeper removal on the horizon, staying updated in ZK mode will eventually be impossible. Also, organizations will see a decrease in cluster complexity as Kafka will handle metadata natively instead utilizing third-party software.

Lastly, organizations operating at the upper levels of cluster size will see a considerable increase in reliability and service continuity in KRaft mode. While ZooKeeper is a reliable coordination service for a myriad of open source projects, whenever our customers with extremely large clusters (i.e. 30/40+ brokers with thousands of partitions) are encountering severe issues, it often winds up being a ZooKeeper issue.

 

If we migrate from ZooKeeper to KRaft, can we decommission our dedicated ZK hardware?

Most likely, no, at least not in production. Production KRaft controllers should be deployed in dedicated controller mode, so they will need dedicated compute just like ZooKeeper in production does. However, non-production clusters can run in mixed mode.

 

We have “N” number of ZooKeeper nodes; how many KRaft controller nodes should we use?

At the very minimum organizations should deploy at least 3 controller nodes in production. The system requirements for ZooKeeper and controller nodes are quite similar, though, so deploying the same number of controller nodes is a reasonable place to start. Ultimately, a thorough load and performance testing protocol should be followed to validate this.

 

If we are running Kafka via Strimzi Kubernetes operator, can we start using KRaft?

Yes! However, be aware that as of version 0.45.0, there some limitations with controller. Currently, Strimzi continues to use static controller quorums, which means there are some limitations on using dynamic controller quorums. More information can be found in the Strimzi documentation here.

 

Final Thoughts

For greenfield implementations, using KRaft should be a no-brainer, but for mature Kafka environments, migrating will be a complete rip and replace for your cluster, with all the complications that could follow. Creating a detailed migration plan, with blue/green deployment strategies, is crucial in such cases. And if your team is lacking in Kafka expertise, seeking out external support to guide your migration would also be a good idea.

This Blog Was Written By One of Our Kafka Experts.

OpenLogic Kafka experts can provide 24/7 technical support, consultations, migration/upgrade assistance, or even train your team.

Explore kafka Solutions 

About Perforce
The best run DevOps teams in the world choose Perforce. Perforce products are purpose-built to develop, build and maintain high-stakes applications. Companies can finally manage complexity, achieve speed without compromise, improve security and compliance, and run their DevOps toolchains with full integrity. With a global footprint spanning more than 80 countries and including over 75% of the Fortune 100, Perforce is trusted by the world’s leading brands to deliver solutions to even the toughest challenges. Accelerate technology delivery, with no shortcuts.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

Planning a CentOS to Rocky Linux Migration

On June 30, 2024, the ten-year run of CentOS 7 came to an end, but more significantly, the date marked all versions of CentOS Linux reaching end of life. Now, more than six months later, many organizations are still planning their migrations to other community-supported Enterprise Linux distributions. They are exploring CentOS alternatives in the hopes avoiding the squeeze from Red Hat and having to pay for Extended Life Cycle Support, on top of RHEL licenses for each host that needs to be kept secure (which is usually all of them).

One popular migration path is CentOS to Rocky Linux. In other blogs, we’ve looked at why many have chosen Rocky Linux, but here, we’ll focus on the primary considerations for organizations that haven’t made the move yet, highlight potential risks, and explain how to evaluate the readiness of a given enterprise architecture to move from one major Enterprise Linux version (6, 7, and 8) to the latest version, Enterprise Linux 9. We’ll also provide a step-by-step walkthrough of a CentOS 7 to Rocky Linux 9 migration.

 

Step 1: Evaluate Whether Your Applications Are Ready for Rocky Linux Migration

How were you made aware that your team has out-of-compliance, End-of-Life CentOS hosts in their architecture in the first place? Perhaps your IT department sent you a email containing a list of these hosts, as the result of a scan. Or maybe you examined a list of AMIs in your Amazon Web Services (AWS) EC2 infrastructure, and came up with a list of affected hosts. Whatever way it happened, you have the list of hosts that need an upgrade, which means you have essentially completed the first step for migration: creating an inventory of hosts that need to make the move to Rocky Linux.

Ideally, you have a list of applications that are running on each of these hosts, and know the purpose of each. For example, “This one is a MariaDB server,” or “This runs Oracle 12c.” If you don’t, now is a good time to start building out a spreadsheet listing the CentOS hosts that need to be upgraded that includes the workload each is responsible for. Then create another inventory, for each host, of what software is installed. Find the owners of each machine, so that you can further examine what software is installed, perhaps in unconventional ways, that you might be missing.

Without knowing what software is on which machines, you won’t know the side effects of completing a major Enterprise Linux version upgrade. The main side effect of upgrading is new versions of the kernel, new versions of software in the yum / dnf repositories, and new versions of glibc and libstdc++. And this side effect can have some major unintended consequences.

Step 2: Decide Between Migrating in Place vs. Building a New System

The major consequences listed above are particularly important when deciding whether to do an in-place upgrade, or building a new system. You can either migrate a system to Rocky Linux in-place, or build a new system and migrate applications and data over. Each option has pros and cons, but let’s examine the three side effects of upgrading: new kernel, new versions of dnf-sourced applications, and new API and ABI versions.

To be clear, it’s not just the ABI contracts of glibc and libstdc++ that can stymie your upgrade plans. All bets are off regarding API and ABI compatibility between all libraries and packages between major versions of EL. Another risk of in-place upgrades is, in the end, the system may have a combination of packages from different EL versions; a few libraries from EL7 and an app from EL8 running on an EL9 box. Hybrid-version systems are incredibly difficult to troubleshoot, or even rebuild if not restoring from a full backup.

When you examine your inventory of hosts that need to be migrated off of CentOS, you need to determine if they’re bare-metal. Physical hosts are much more likely to have custom kernel modules or drivers built against the current kernel against source, source that’s often proprietary. Perhaps the host is connected to a tape library, has a PCI-E card like a graphics card, or has another peripheral that’s connected to an industrial application from a third-party manufacturer. For this reason, a hardware inventory is incredibly important: what peripherals or non-standard hardware are installed in the host?

Most open source drivers and kernel modules are going to seamlessly work from one major EL release to another, but add-ons that aren’t shipped with the server are more likely to need a driver to be recompiled against the new kernel in Rocky Linux. Make sure you can both obtain the source code for the driver and that it compiles against the new kernel. Otherwise you might successfully upgrade to Rocky Linux, but be stuck with an application that can’t reach a critical peripheral.

If you are on physical hardware as described above, migrating in place has the advantage that you don’t need more systems. This may be the easiest (and least expensive) route. You do have to make sure that connections are stable, and the machine will be available the whole time, because if the upgrade script is interrupted, the system may be left in an unrecoverable state. This would probably require a rescue disk and some manual work to get to a point where it is usable again. If you can’t take a system out of production to rebuild, then running the migration may be the best option, as the migration can be planned in a standard maintenance window.

However, if you are on a virtualized infrastructure, or have spare hardware, it would be safer to build a new system as you want it, then migrate the data and applications over and swap out the old system with the new. But even then, a software inventory is especially important due to the upgraded dnf-sourced software and upgraded glibc/libstdc++ libraries.

Step 3: Use a Software Inventory to Mitigate Risk

If your organization produces software with C++, it’s possible that you’re producing applications “dynamically linked” against system libraries. If you’re targeting CentOS 7, and you upgrade to Rocky Linux 9, the libraries you linked against are going to be upgraded, and the application you wrote might suddenly crash at runtime, even if it starts successfully after the upgrade.

This is because certain standard library methods might remain the same, allowing the application to start, but others might have been deprecated or changed, causing the application to crash when they’re no longer available. Because these safety checks are completed at compile-time, a runtime error might occur.

Even if your organization doesn’t maintain their own C++ applications, you might be installing applications from a third-party vendor that link against CentOS 7 libraries. If this vendor uses an external yum or dnf-based repository, there’s a good chance that the upgrade to Rocky might fail due to unresolved dependency issues. If the application is installed by downloading a .tar.gz, .sh, or .run file, and binaries are installed onto the host from that installer, there’s a strong possibility that the application might suffer similar crashes or incompatibilities from unexpected versions of C++ libraries, Python bindings, Lua versions, and the list goes on.

All of the above illustrates why it’s so important to have a software inventory of some kind. It could be as simple as a spreadsheet or a Software Bill of Materials (SBOM). Once you have that inventory, you can plan ahead and contact the vendors of third-party software, making sure they have an EL 8 or EL9 version of their software that can be upgraded to once your host becomes a Rocky Linux host.

As for the dnf-sourced packages, and considering the previously mentioned issues with version numbers changing for supporting libraries, moving from CentOS 7 to Rocky Linux 9 can include some major upgrades. For example, the upgrade from MariaDB 5.5, which was released in April 2012, to MariaDB 10.5.27, which shipped in May 2023. As you can imagine, there needs to be an end-to-end plan for this upgrade, and all of the hard-to-predict “ripples” it may cause.

What happens when you run into applications that can’t run on the new version of Rocky 8 or Rocky 9? One option would be containerizing old workloads in order to increase security, reducing the attack surface by running in an isolated container running on an up-to-date Rocky 9 host.

Migration Services and Technical Support for Rocky Linux

Get assistance from Enterprise Architects who are directly involved with the Rocky Linux project. Talk to a Rocky Linux Expert 

Step 4: Determine Your CentOS to Rocky Linux Migration Path

There are a few considerations teams will need to make when migrating from CentOS to Rocky Linux. Depending on the CentOS version(s) being used, the upgrade path may require migrating to intermediary versions before arriving at the intended version. (E.g., CentOS 6 to CentOS 7, CentOS 7 to CentOS 8, CentOS 8 to Rocky Linux 8, then Rocky Linux 8 to Rocky Linux 9).

Approaching your upgrade with this step-wise approach is also useful for mitigating the risks described above. Catching incompatibilities early on in the upgrade process will be critical for informing leadership of how long this process is actually going to take.

Migrating CentOS 6 to Rocky Linux

Unfortunately, there is no direct migration path from CentOS 6 to Rocky Linux. Rocky Linux starts at 8, so hosts will have to be on CentOS 8 to migrate. As described above, there are too many changes between CentOS 6 and 7, much less 8, to migrate. Once you’ve migrated to CentOS 7 you can migrate from CentOS 7 to CentOS 8 then migrate to Rocky Linux 9, or migrate from CentOS 8 to Rocky Linux 8 then upgrade to Rocky Linux 9.

For CentOS 6 users there are two ways to do this.

  1. Upgrade from CentOS 6, to 7, to 8, then migrate to Rocky Linux 8, then upgrade to Rocky Linux 9. This process can take a considerable amount of time, and can run into some additional hiccups due to the package changes between major versions.
  2. Build a new machine and migrate your data over. The best case scenario with this approach is that all third party software needed in the new machine has a new version, and all your data can be upgraded safely. The worst case is that you’ll have software with no new version and will need to find alternatives or find a way to run the software anyway. This depends on the libraries being used in your CentOS 6 install. Luckily, containerization makes it easy to run older versions of software on newer systems, or even on completely different distributions entirely.

Migrating CentOS 7 to Rocky Linux

The migration path for CentOS 7 to Rocky Linux is similar to CentOS 6. However, migrating from CentOS 7 to Rocky Linux is a bit easier because CentOS 7 already uses systemd for service management, while CentOS 6 still uses legacy SysV init scripts.

There are a few other changes to keep an eye out for when moving from CentOS 7 to Rocky Linux, but, aside from the service management difference, the considerations are nearly identical to CentOS 6 migration.

Video: CentOS 7 Migration Recommendations

Migrating CentOS 8 to Rocky Linux

Migrating from CentOS 8 to Rocky Linux 8 is relatively painless, and avoids all of the risks described above. Since it is nearly identical, there are only a few changes, which makes this the least risky and easiest to validate step in the process. The repos are swapped out for Rocky Linux repos, and a few packages are replaced (mostly branding packages, for example, the package that provides all of the logos for CentOS).

Real-World Example: Upgrading CentOS 7 to Rocky Linux 9

In this section, we will walk through a CentOS 7 to Rocky Linux 9 migration, showing all the steps involved and potential trouble spots.

1. Install the Latest Version of the ELevate Repository from the AlmaLinux Project

$ yum install -y http://repo.almalinux.org/elevate/elevate-release-latest-el$(rpm –eval %rhel).noarch.rpm

Loaded plugins: auto-update-debuginfo, fastestmirror

elevate-release-latest-el7.noarch.rpm                    | 6.6 kB     00:00

Examining /var/tmp/yum-root-nXSITp/elevate-release-latest-el7.noarch.rpm: elevate-release-1.0-2.el7.noarch

Marking /var/tmp/yum-root-nXSITp/elevate-release-latest-el7.noarch.rpm to be installed

Resolving Dependencies

–> Running transaction check

—> Package elevate-release.noarch 0:1.0-2.el7 will be installed

–> Finished Dependency Resolution

Dependencies Resolved

================================================================================

 Package          Arch    Version     Repository                           Size

================================================================================

Installing:

 elevate-release  noarch  1.0-2.el7   /elevate-release-latest-el7.noarch  3.4 k

Transaction Summary

================================================================================

Install  1 Package

Total size: 3.4 k

Installed size: 3.4 k

Downloading packages:

Running transaction check

Running transaction test

Transaction test succeeded

Running transaction

  Installing : elevate-release-1.0-2.el7.noarch                             1/1

  Verifying  : elevate-release-1.0-2.el7.noarch                             1/1

Installed:

  elevate-release.noarch 0:1.0-2.el7

Complete!

2. Install the LEAPP Package

Specifically, we are going to install the leapp-data-rocky package, which will help us move to Rocky Linux, as opposed to AlmaLinux:

$ yum install -y leapp-upgrade leapp-data-rocky

Loaded plugins: auto-update-debuginfo, fastestmirror

Determining fastest mirrors

Resolving Dependencies

–> Running transaction check

—> Package leapp-data-rocky.noarch 0:0.5-1.el7.20241127 will be installed

—> Package leapp-upgrade-el7toel8.noarch 1:0.21.0-4.el7.elevate.4 will be installed

–> Processing Dependency: leapp-repository-dependencies = 10 for package: 1:leapp-upgrade-el7toel8-0.21.0-4.el7.elevate.4.noarch

–> Processing Dependency: leapp-framework >= 6.0 for package: 1:leapp-upgrade-el7toel8-0.21.0-4.el7.elevate.4.noarch

–> Processing Dependency: leapp for package: 1:leapp-upgrade-el7toel8-0.21.0-4.el7.elevate.4.noarch

–> Processing Dependency: python2-leapp for package: 1:leapp-upgrade-el7toel8-0.21.0-4.el7.elevate.4.noarch

–> Running transaction check

—> Package leapp.noarch 0:0.18.0-2.el7 will be installed

—> Package leapp-upgrade-el7toel8-deps.noarch 1:0.21.0-4.el7.elevate.4 will be installed

–> Processing Dependency: dnf >= 4 for package: 1:leapp-upgrade-el7toel8-deps-0.21.0-4.el7.elevate.4.noarch

–> Processing Dependency: pciutils for package: 1:leapp-upgrade-el7toel8-deps-0.21.0-4.el7.elevate.4.noarch

—> Package python2-leapp.noarch 0:0.18.0-2.el7 will be installed

–> Processing Dependency: leapp-framework-dependencies = 6 for package: python2-leapp-0.18.0-2.el7.noarch

–> Running transaction check

—> Package dnf.noarch 0:4.0.9.2-2.el7_9 will be installed

–> Processing Dependency: python2-dnf = 4.0.9.2-2.el7_9 for package: dnf-4.0.9.2-2.el7_9.noarch

—> Package leapp-deps.noarch 0:0.18.0-2.el7 will be installed

–> Processing Dependency: PyYAML for package: leapp-deps-0.18.0-2.el7.noarch

—> Package pciutils.x86_64 0:3.5.1-3.el7 will be installed

–> Running transaction check

—> Package PyYAML.x86_64 0:3.10-11.el7 will be installed

–> Processing Dependency: libyaml-0.so.2()(64bit) for package: PyYAML-3.10-11.el7.x86_64

—> Package python2-dnf.noarch 0:4.0.9.2-2.el7_9 will be installed

–> Processing Dependency: dnf-data = 4.0.9.2-2.el7_9 for package: python2-dnf-4.0.9.2-2.el7_9.noarch

–> Processing Dependency: python2-libdnf >= 0.22.5 for package: python2-dnf-4.0.9.2-2.el7_9.noarch

–> Processing Dependency: python2-libcomps >= 0.1.8 for package: python2-dnf-4.0.9.2-2.el7_9.noarch

–> Processing Dependency: python2-hawkey >= 0.22.5 for package: python2-dnf-4.0.9.2-2.el7_9.noarch

–> Processing Dependency: libmodulemd >= 1.4.0 for package: python2-dnf-4.0.9.2-2.el7_9.noarch

–> Processing Dependency: python2-libdnf for package: python2-dnf-4.0.9.2-2.el7_9.noarch

–> Processing Dependency: deltarpm for package: python2-dnf-4.0.9.2-2.el7_9.noarch

–> Running transaction check

—> Package deltarpm.x86_64 0:3.6-3.el7 will be installed

—> Package dnf-data.noarch 0:4.0.9.2-2.el7_9 will be installed

–> Processing Dependency: libreport-filesystem for package: dnf-data-4.0.9.2-2.el7_9.noarch

—> Package libmodulemd.x86_64 0:1.6.3-1.el7 will be installed

—> Package libyaml.x86_64 0:0.1.4-11.el7_0 will be installed

—> Package python2-hawkey.x86_64 0:0.22.5-2.el7_9 will be installed

–> Processing Dependency: libdnf(x86-64) = 0.22.5-2.el7_9 for package: python2-hawkey-0.22.5-2.el7_9.x86_64

–> Processing Dependency: libsolvext.so.0(SOLV_1.0)(64bit) for package: python2-hawkey-0.22.5-2.el7_9.x86_64

–> Processing Dependency: libsolv.so.0(SOLV_1.0)(64bit) for package: python2-hawkey-0.22.5-2.el7_9.x86_64

–> Processing Dependency: libsolvext.so.0()(64bit) for package: python2-hawkey-0.22.5-2.el7_9.x86_64

–> Processing Dependency: libsolv.so.0()(64bit) for package: python2-hawkey-0.22.5-2.el7_9.x86_64

–> Processing Dependency: librepo.so.0()(64bit) for package: python2-hawkey-0.22.5-2.el7_9.x86_64

–> Processing Dependency: libdnf.so.2()(64bit) for package: python2-hawkey-0.22.5-2.el7_9.x86_64

—> Package python2-libcomps.x86_64 0:0.1.8-14.el7 will be installed

–> Processing Dependency: libcomps(x86-64) = 0.1.8-14.el7 for package: python2-libcomps-0.1.8-14.el7.x86_64

–> Processing Dependency: libcomps.so.0.1.6()(64bit) for package: python2-libcomps-0.1.8-14.el7.x86_64

—> Package python2-libdnf.x86_64 0:0.22.5-2.el7_9 will be installed

–> Running transaction check

—> Package libcomps.x86_64 0:0.1.8-14.el7 will be installed

—> Package libdnf.x86_64 0:0.22.5-2.el7_9 will be installed

—> Package librepo.x86_64 0:1.8.1-8.el7_9 will be installed

—> Package libreport-filesystem.x86_64 0:2.1.11-53.el7.centos will be installed

—> Package libsolv.x86_64 0:0.6.34-4.el7 will be installed

–> Finished Dependency Resolution

Dependencies Resolved

================================================================================

 Package                Arch   Version                  Repository         Size

================================================================================

Installing:

 leapp-data-rocky       noarch 0.5-1.el7.20241127       elevate           323 k

 leapp-upgrade-el7toel8 noarch 1:0.21.0-4.el7.elevate.4 elevate           1.2 M

Installing for dependencies:

 PyYAML                 x86_64 3.10-11.el7              C7.9.2009-base    153 k

 deltarpm               x86_64 3.6-3.el7                C7.9.2009-base     82 k

 dnf                    noarch 4.0.9.2-2.el7_9          C7.9.2009-extras  357 k

 dnf-data               noarch 4.0.9.2-2.el7_9          C7.9.2009-extras   51 k

 leapp                  noarch 0.18.0-2.el7             elevate            31 k

 leapp-deps             noarch 0.18.0-2.el7             elevate            13 k

 leapp-upgrade-el7toel8-deps

                        noarch 1:0.21.0-4.el7.elevate.4 elevate            41 k

 libcomps               x86_64 0.1.8-14.el7             C7.9.2009-extras   75 k

 libdnf                 x86_64 0.22.5-2.el7_9           C7.9.2009-extras  535 k

 libmodulemd            x86_64 1.6.3-1.el7              C7.9.2009-extras  141 k

 librepo                x86_64 1.8.1-8.el7_9            C7.9.2009-updates  82 k

 libreport-filesystem   x86_64 2.1.11-53.el7.centos     C7.9.2009-base     41 k

 libsolv                x86_64 0.6.34-4.el7             C7.9.2009-base    329 k

 libyaml                x86_64 0.1.4-11.el7_0           C7.9.2009-base     55 k

 pciutils               x86_64 3.5.1-3.el7              C7.9.2009-base     93 k

 python2-dnf            noarch 4.0.9.2-2.el7_9          C7.9.2009-extras  414 k

 python2-hawkey         x86_64 0.22.5-2.el7_9           C7.9.2009-extras   71 k

 python2-leapp          noarch 0.18.0-2.el7             elevate           195 k

 python2-libcomps       x86_64 0.1.8-14.el7             C7.9.2009-extras   47 k

 python2-libdnf         x86_64 0.22.5-2.el7_9           C7.9.2009-extras  611 k

Transaction Summary

================================================================================

Install  2 Packages (+20 Dependent packages)

Total download size: 4.8 M

Installed size: 33 M

Downloading packages:

(1/22): deltarpm-3.6-3.el7.x86_64.rpm                      |  82 kB   00:00

warning: /var/cache/yum/x86_64/7/elevate/packages/leapp-0.18.0-2.el7.noarch.rpm: Header V4 RSA/SHA256 Signature, key ID 81b961a5: NOKEY

Public key for leapp-0.18.0-2.el7.noarch.rpm is not installed

(2/22): leapp-0.18.0-2.el7.noarch.rpm                      |  31 kB   00:00

(3/22): PyYAML-3.10-11.el7.x86_64.rpm                      | 153 kB   00:00

(4/22): leapp-deps-0.18.0-2.el7.noarch.rpm                 |  13 kB   00:00

(5/22): dnf-data-4.0.9.2-2.el7_9.noarch.rpm                |  51 kB   00:00

(6/22): dnf-4.0.9.2-2.el7_9.noarch.rpm                     | 357 kB   00:00

(7/22): leapp-upgrade-el7toel8-0.21.0-4.el7.elevate.4.noar | 1.2 MB   00:00

(8/22): leapp-upgrade-el7toel8-deps-0.21.0-4.el7.elevate.4 |  41 kB   00:00

(9/22): libcomps-0.1.8-14.el7.x86_64.rpm                   |  75 kB   00:00

(10/22): leapp-data-rocky-0.5-1.el7.20241127.noarch.rpm    | 323 kB   00:00

(11/22): libdnf-0.22.5-2.el7_9.x86_64.rpm                  | 535 kB   00:00

(12/22): libmodulemd-1.6.3-1.el7.x86_64.rpm                | 141 kB   00:00

(13/22): libreport-filesystem-2.1.11-53.el7.centos.x86_64. |  41 kB   00:00

(14/22): librepo-1.8.1-8.el7_9.x86_64.rpm                  |  82 kB   00:00

(15/22): libyaml-0.1.4-11.el7_0.x86_64.rpm                 |  55 kB   00:00

(16/22): pciutils-3.5.1-3.el7.x86_64.rpm                   |  93 kB   00:00

(17/22): python2-leapp-0.18.0-2.el7.noarch.rpm             | 195 kB   00:00

(18/22): libsolv-0.6.34-4.el7.x86_64.rpm                   | 329 kB   00:00

(19/22): python2-hawkey-0.22.5-2.el7_9.x86_64.rpm          |  71 kB   00:00

(20/22): python2-libcomps-0.1.8-14.el7.x86_64.rpm          |  47 kB   00:00

(21/22): python2-dnf-4.0.9.2-2.el7_9.noarch.rpm            | 414 kB   00:00

(22/22): python2-libdnf-0.22.5-2.el7_9.x86_64.rpm          | 611 kB   00:00

——————————————————————————–

Total                                              2.4 MB/s | 4.8 MB  00:01

Retrieving key from file:///etc/pki/rpm-gpg/RPM-GPG-KEY-ELevate

Importing GPG key 0x81B961A5:

 Userid     : “ELevate <packager@almalinux.org>”

 Fingerprint: 74e7 f249 ee69 8a4d acfb 48c8 4297 85e1 81b9 61a5

 Package    : elevate-release-1.0-2.el7.noarch (@/elevate-release-latest-el7.noarch)

 From       : /etc/pki/rpm-gpg/RPM-GPG-KEY-ELevate

Running transaction check

Running transaction test

Transaction test succeeded

Running transaction

  Installing : libsolv-0.6.34-4.el7.x86_64                                 1/22

  Installing : librepo-1.8.1-8.el7_9.x86_64                                2/22

  Installing : libyaml-0.1.4-11.el7_0.x86_64                               3/22

  Installing : libmodulemd-1.6.3-1.el7.x86_64                              4/22

  Installing : libdnf-0.22.5-2.el7_9.x86_64                                5/22

  Installing : python2-libdnf-0.22.5-2.el7_9.x86_64                        6/22

  Installing : python2-hawkey-0.22.5-2.el7_9.x86_64                        7/22

  Installing : PyYAML-3.10-11.el7.x86_64                                   8/22

  Installing : leapp-deps-0.18.0-2.el7.noarch                              9/22

  Installing : python2-leapp-0.18.0-2.el7.noarch                          10/22

  Installing : pciutils-3.5.1-3.el7.x86_64                                11/22

  Installing : deltarpm-3.6-3.el7.x86_64                                  12/22

  Installing : libreport-filesystem-2.1.11-53.el7.centos.x86_64           13/22

  Installing : dnf-data-4.0.9.2-2.el7_9.noarch                            14/22

  Installing : libcomps-0.1.8-14.el7.x86_64                               15/22

  Installing : python2-libcomps-0.1.8-14.el7.x86_64                       16/22

  Installing : python2-dnf-4.0.9.2-2.el7_9.noarch                         17/22

  Installing : dnf-4.0.9.2-2.el7_9.noarch                                 18/22

  Installing : 1:leapp-upgrade-el7toel8-deps-0.21.0-4.el7.elevate.4.noa   19/22

  Installing : leapp-0.18.0-2.el7.noarch                                  20/22

  Installing : 1:leapp-upgrade-el7toel8-0.21.0-4.el7.elevate.4.noarch     21/22

  Installing : leapp-data-rocky-0.5-1.el7.20241127.noarch                 22/22

  Verifying  : dnf-4.0.9.2-2.el7_9.noarch                                  1/22

  Verifying  : leapp-0.18.0-2.el7.noarch                                   2/22

  Verifying  : libdnf-0.22.5-2.el7_9.x86_64                                3/22

  Verifying  : librepo-1.8.1-8.el7_9.x86_64                                4/22

  Verifying  : libmodulemd-1.6.3-1.el7.x86_64                              5/22

  Verifying  : dnf-data-4.0.9.2-2.el7_9.noarch                             6/22

  Verifying  : leapp-data-rocky-0.5-1.el7.20241127.noarch                  7/22

  Verifying  : libcomps-0.1.8-14.el7.x86_64                                8/22

  Verifying  : libreport-filesystem-2.1.11-53.el7.centos.x86_64            9/22

  Verifying  : python2-hawkey-0.22.5-2.el7_9.x86_64                       10/22

  Verifying  : deltarpm-3.6-3.el7.x86_64                                  11/22

  Verifying  : python2-dnf-4.0.9.2-2.el7_9.noarch                         12/22

  Verifying  : leapp-deps-0.18.0-2.el7.noarch                             13/22

  Verifying  : python2-libdnf-0.22.5-2.el7_9.x86_64                       14/22

  Verifying  : libyaml-0.1.4-11.el7_0.x86_64                              15/22

  Verifying  : python2-libcomps-0.1.8-14.el7.x86_64                       16/22

  Verifying  : 1:leapp-upgrade-el7toel8-0.21.0-4.el7.elevate.4.noarch     17/22

  Verifying  : 1:leapp-upgrade-el7toel8-deps-0.21.0-4.el7.elevate.4.noa   18/22

  Verifying  : libsolv-0.6.34-4.el7.x86_64                                19/22

  Verifying  : python2-leapp-0.18.0-2.el7.noarch                          20/22

  Verifying  : PyYAML-3.10-11.el7.x86_64                                  21/22

  Verifying  : pciutils-3.5.1-3.el7.x86_64                                22/22

Installed:

  leapp-data-rocky.noarch 0:0.5-1.el7.20241127

  leapp-upgrade-el7toel8.noarch 1:0.21.0-4.el7.elevate.4

Dependency Installed:

  PyYAML.x86_64 0:3.10-11.el7

  deltarpm.x86_64 0:3.6-3.el7

  dnf.noarch 0:4.0.9.2-2.el7_9

  dnf-data.noarch 0:4.0.9.2-2.el7_9

  leapp.noarch 0:0.18.0-2.el7

  leapp-deps.noarch 0:0.18.0-2.el7

  leapp-upgrade-el7toel8-deps.noarch 1:0.21.0-4.el7.elevate.4

  libcomps.x86_64 0:0.1.8-14.el7

  libdnf.x86_64 0:0.22.5-2.el7_9

  libmodulemd.x86_64 0:1.6.3-1.el7

  librepo.x86_64 0:1.8.1-8.el7_9

  libreport-filesystem.x86_64 0:2.1.11-53.el7.centos

  libsolv.x86_64 0:0.6.34-4.el7

  libyaml.x86_64 0:0.1.4-11.el7_0

  pciutils.x86_64 0:3.5.1-3.el7

  python2-dnf.noarch 0:4.0.9.2-2.el7_9

  python2-hawkey.x86_64 0:0.22.5-2.el7_9

  python2-leapp.noarch 0:0.18.0-2.el7

  python2-libcomps.x86_64 0:0.1.8-14.el7

  python2-libdnf.x86_64 0:0.22.5-2.el7_9

Complete!

3. Run the Pre-Upgrade Checks

This stage is typically where sticky situations will show up. Even on a simple system like the one we’re using for this example, the pre-upgrade is very likely to have critical errors. It is worth noting that this step does not change the system, so no side effects should be expected from running this command; notice that it does not require root permissions.

$ leapp preupgrade

============================================================

                      REPORT OVERVIEW

============================================================

Upgrade has been inhibited due to the following problems:

    1. Leapp detected loaded kernel drivers which have been removed in RHEL 8. Upgrade cannot proceed.

    2. Missing required answers in the answer file

HIGH and MEDIUM severity reports:

    1. GRUB2 core will be automatically updated during the upgrade

    2. Difference in Python versions and support in RHEL 8

    3. Packages not signed by Red Hat found on the system

    4. Detected custom leapp actors or files.

    5. Detected customized configuration for dynamic linker.

    6. ipa-server package is installed but no IdM is configured

    7. chrony using default configuration

Reports summary:

    Errors:                      0

    Inhibitors:                  2

    HIGH severity reports:       5

    MEDIUM severity reports:     2

    LOW severity reports:        4

    INFO severity reports:       2

Before continuing, review the full report below for details about discovered problems and possible remediation instructions:

    A report has been generated at /var/log/leapp/leapp-report.txt

    A report has been generated at /var/log/leapp/leapp-report.json

============================================================

                   END OF REPORT OVERVIEW

============================================================

The leapp-report.txt that is generated from the pre-upgrade command contains some canned resolutions to the errors it generated. Let’s try some of those answers!

# Remove pkcs11 module

$ leapp answer –section remove_pam_pkcs11_module_check.confirm=True

$ rmmod pata_acpi floppy

There are plenty of other messages that were generated and put in the report, but only the ones we used above were necessary for the upgrade to move forward. If you have third-party repositories, there’s a possibility that upgraded versions of system dependencies might be present. You’ll have to manually remove or downgrade those packages to resolve these version inconsistencies.

In order for the upgrade to proceed, you can’t have any errors that will inhibit the upgrade. This will appear in the “Reports Summary” under “Inhibitors.” You may address some of the warnings that appear by examining the report and testing the suggestions they provide to see if you can resolve the warnings. Keep in mind only the inhibitors are required to proceed, though.

4. Run the Upgrade Process, Reboot, and Test

$ leapp upgrade

Once this command completes, it’s time to reboot into the new kernel. You will need to manually intervene and select the new boot option in GRUB: ELevate-Upgrade-Initramfs

Now that you’ve booted into the new environment, you should see that you’re running on Rocky 8. Run through your standard QA tests to make sure the services the host is providing all work as expected. Continuing the upgrade to Rocky 9 isn’t likely to fix services that are broken at this stage, so conduct a thorough check before continuing the upgrade.

5. Upgrade to Rocky 9

The steps for continuing the upgrade to Rocky 9 are similar to the steps we took to upgrade to Rocky 8, starting with the ELevate repo:

$ yum install -y http://repo.almalinux.org/elevate/elevate-release-latest-el$(rpm –eval %rhel).noarch.rpm

Next, we can remove package exclusions that were added from the previous LEAPP upgrade:

$ yum config-manager –save –setopt exclude=”

You might run into a scenario like we did where we had to remove LEAPP with its dependencies, because because leapp-upgrade-el7toel8 was still installed, and it failed because it didn’t match the version. Then you can install the following:

$ yum install -y leapp-upgrade leapp-data-rocky

And then run the LEAPP preupgrade again: $ leapp preupgrade

The output of the report will be similar to the previous one. After examining the report, we had to run these steps:

$ sed -i “s/^AllowZoneDrifting=.*/AllowZoneDrifting=no/” /etc/firewalld/firewalld.conf

$ leapp answer –section check_vdo.confirm=True

Note: If root login is allowed, the report will fail. We resolved this by overriding our sshd_config:

$ sed -i ‘s/^PermitRootLogin yes$/PermitRootLogin yes # test/’ /etc/ssh/sshd_config

Here was our last report before we ran the upgrade again:

$ leapp upgrade

============================================================

                      REPORT OVERVIEW

============================================================

HIGH and MEDIUM severity reports:

    1. Packages not signed by Red Hat found on the system

    2. Detected custom leapp actors or files.

    3. Leapp detected loaded kernel drivers which are no longer maintained in RHEL 9.

    4. Remote root logins globally allowed using password

    5. GRUB2 core will be automatically updated during the upgrade

Reports summary:

    Errors:                      0

    Inhibitors:                  0

    HIGH severity reports:       5

    MEDIUM severity reports:     0

    LOW severity reports:        2

    INFO severity reports:       3

Before continuing, review the full report below for details about discovered problems and possible remediation instructions:

    A report has been generated at /var/log/leapp/leapp-report.txt

    A report has been generated at /var/log/leapp/leapp-report.json

============================================================

                   END OF REPORT OVERVIEW

============================================================

Let’s give it a shot!

$ leapp upgrade

Again, even on a simple system, we can get blocking errors:

Following errors occurred and the upgrade cannot continue:

    1. Actor: dnf_package_download

       Message: DNF execution failed with non zero exit code.

Looking at /var/log/leapp/leapp-report.txt, there are a number of warnings, including a conflict between rocky-logos 86.3-1.el8 and rocky-logos-90.15-2.el9:

file /usr/share/redhat-logos from install of rocky-logos-90.15-2.el9.x86_64 conflicts with file from package rocky-logos-86.3-1.el8.x86_64

To resolve this, we removed rocky-logos (which also removed rocky-backgrounds), and re-ran the LEAPP upgrade. Our next step was reboot, just as we did during the CentOS 7 to Rocky 8 upgrade, and select the grub entry ELevate-Upgrade-Initramfs again, and watch it go!

Upon rebooting, it was time to remove the excludes again:

$ yum config-manager –save –setopt exclude=”

Then we could remove orphaned packages, which cleans up the system and makes it more secure:

$ rpm -qa | grep -E ‘el8[.-]’ | xargs rpm -e

The same goes for the LEAPP packages, since they’re not needed anymore:

$ dnf remove $(rpm -qa|grep leapp)

Done! Now it’s time to run our E2E and integration tests. After thorough testing of this host, it will be ready to re-enter production as an upgraded system.

Final Thoughts

Organizations with a long list of hosts in their host inventory, or hosts with especially complex software inventories, may need assistance sorting out all of the complexities associated with a CentOS to Rocky Linux migration. Or they might need more time than they initially allotted for the project. Partnering with OpenLogic for CentOS long-term support or migration services can ease the burden considerably. Our Professional Services team can help you plan your migration or provide hands-on-keyboard support throughout the process. And after you have successfully migrated, our Enterprise Architects can offer Rocky Linux support up to 24-7.

About Perforce
The best run DevOps teams in the world choose Perforce. Perforce products are purpose-built to develop, build and maintain high-stakes applications. Companies can finally manage complexity, achieve speed without compromise, improve security and compliance, and run their DevOps toolchains with full integrity. With a global footprint spanning more than 80 countries and including over 75% of the Fortune 100, Perforce is trusted by the world’s leading brands to deliver solutions to even the toughest challenges. Accelerate technology delivery, with no shortcuts.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

How to Find the Best Linux Distro for Your Organization

“What’s the best Linux distro?”

A better question to ask: “Which Linux distro can meet my business’s needs now and as we scale?”

Now that CentOS Linux has reached end of life, the playing field has widened, with several viable alternatives. This blog gives an overview of the post-CentOS EOL Linux landscape, comparing the most popular Enterprise Linux distributions and highlighting key differentiators. As you read, keep in mind your own team’s bandwidth and expertise with managing Linux infrastructure as you’re evaluating factors like cost, stability, and security.

As those in the process of migrating off CentOS know all too well, longevity is important, too. Confidence in the project’s direction, the strength of the community, governance model (i.e. how much control a for-profit corporation has) — these are all considerations that could (and should) influence which open source Linux distro is the right fit for your organization.

 

 

Types of Open Source Linux Distributions

Linux distros are a combination of the open source Linux kernel and a suite of supporting software that facilitates the development and operation of applications. Open source communities make decisions about which packages to include based on the use cases they want to prioritize. A Linux distribution designed for desktop, for example, might include tools like media players and UI customizability features. Enterprise Linux distros, on the other hand, focus more on security, stability, and speed to optimize performance for mission-critical applications.

There are a few different ways to categorize open source Linux distros. You can bucket them according to who manages the project (a community or a commercial entity), the release model (rolling or fixed), or the upstream source (e.g. Fedora, Debian).

 

Community vs. Commercial

The difference between community vs. commercial open source Linux distributions is that community-backed Linux distributions are free to use and supported by a community made up of individual contributors. These volunteers dedicate time and expertise to maintain the project and commit to releasing security updates, bug fixes, and new versions.

Commercial Enterprise Linux distributions are sold by software vendors who build their product from open source components and packages and require a paid subscription. The distro itself is functionally identical to the community version, but users have access to technical support, and often some proprietary enterprise features/tooling.

 

Rolling Release vs. Fixed Release

Rolling release means that updates and new features are continuously and incrementally released instead of bundled into versions that are released on a fixed schedule. Frequent updates to the Linux kernel, libraries, utilities, or any package are released as soon as they are ready without waiting for a defined release date. Typically, rolling release Linux distros do not require users to perform large-scale version upgrades because of this “steady drip” of updates. Issues, bugs, and vulnerabilities can be identified and resolved more rapidly compared to fixed, or regular, release distros.

Rolling release distros appeal to those who prioritize having the latest software and features over stability. However, they require users to stay proactive in system maintenance and be prepared to address issues that arise due to the constant stream of updates. Rolling release models can often lead to conflicts between different software versions as no testing is done to validate that different software interoperates correctly; sometimes new features in a new package release can also lead to subtle behavior differences that cause application breakage. As such, many organizations prefer fixed release models for their business critical applications.

 

Upstream Source

There are distros derived from Fedora, RHEL (which itself comes from Fedora), Debian, SUSE, and more. Each ecosystem has strengths, and preference here might come down to what your team is accustomed to and other considerations (for instance, if you are already an Oracle customer, Oracle Linux might make more sense than if you are not).

Now let’s take a closer look at some of the distributions themselves, grouping them by their upstream source and starting with Fedora.

Note: Asterisks denote that the distribution is currently supported by OpenLogic. 

Fedora and RHEL-Based Linux Distros

Fedora*

Fedora is a popular, community-backed Linux distro known for its emphasis on new features and technologies, and open source collaboration. It aims to provide a platform for both desktop and server users, offering the latest software while maintaining a balance between innovation and stability. Fedora users appreciate staying on the forefront of technology, contributing to open source projects, and experimenting with the latest software innovations. Fedora typically releases two new versions a year, one in the spring and one in the fall.

 

CentOS Stream*

CentOS Stream is referred to as the “rolling preview” on which RHEL releases are based. It is the bridge between Fedora and RHEL, using the same source code Red Hat uses to produce the next version of RHEL. The current version is CentOS Stream 10 and precedes RHEL 10 (and downstream RHEL distros like Rocky Linux and AlmaLinux).

Picking CentOS Stream comes down to your preferences for your overall Linux ecosystem. Everything that you expect inside a RHEL/CentOS ecosystem, such as package manager and virtualization options, will still be available to you in Stream, and you’ll receive bug fixes and security patches on a faster schedule than on CentOS Linux. If you’re on the fence about the rolling release route and not sure your organization is ready, this CentOS Stream migration checklist is a good resource.

 

Red Hat Enterprise Linux (RHEL)

RHEL is a well-established commercial Enterprise Linux distro known for its stability, long-term support, and comprehensive ecosystem. It offers various editions tailored for different workloads and environments, such as servers, cloud, and container deployments. RHEL is built off of snapshots of CentOS Stream, freezing all software versions to those in the snapshot, and only applying security fixes going forward from that release version. This is what gives it stability and security.

Red Hat, now owned by IBM, provides support for RHEL customers, but the license cost and annual fees may be prohibitively expensive for some organizations. As with any commercial software, there is a greater risk of vendor lock-in as well.

 

CentOS Linux (Discontinued)*

Much to the community’s surprise (and dismay), CentOS 8 was prematurely sunsetted in 2021 just two years after its release and CentOS 7 reached end of life in 2024. Red Hat, who then controlled the project, announced the end of CentOS Linux as part of their decision to focus more on CentOS Stream. This led to the creation of new distros derived from the RHEL source code, most notably Rocky Linux and AlmaLinux, to replace CentOS Linux.

Migrating and decommissioning environments can take months (or even years), so CentOS long-term support is one option for businesses that need more time to evaluate other distros and transition their EOL CentOS deployments.

 

Rocky Linux*

Rocky Linux is a community-supported Linux distro created by one of the founders of CentOS and one of the most popular CentOS alternatives. Promising bug-for-bug compatibility with RHEL, Rocky Linux aims to provide a stable, reliable, and compatible platform for organizations and users who were previously relying on CentOS for their server infrastructure.

Related Blog >>Comparing Rocky Linux vs. RHEL

 

AlmaLinux*

Like Rocky Linux, AlmaLinux is a community-backed, open source Linux distro launched in response to the CentOS Linux project being discontinued. AlmaLinux is binary-compatible with RHEL, meaning that applications will run on AlmaLinux as seamlessly as in RHEL.

 

Oracle Linux*

Oracle Linux is packaged and distributed by Oracle, and is another binary-compatible rebuild of RHEL’s RPMs. Oracle Linux is tested and optimized to work well with Oracle’s other software offerings, making it a suitable choice for running Oracle databases and other application workloads. Some worry that eventually Oracle might start charging for Oracle Linux (like they did with OracleJDK in 2019), but as of now, it is free and at a price point similar to RHEL, you can purchase SLA-backed commercial support.

Get the Decision Maker’s Guide to Enterprise Linux

In this complete guide to the Enterprise Linux landscape, our experts present insights and analysis on 20 of the top Enterprise Linux distributions — with a full comparison matrix and battlecards.

Download for Free

Debian-Based Linux Distributions

 

Debian Linux*

Debian is known for its commitment to open source principles, stability, and extensive package management system. It serves as the foundation for various other Linux distros such as Ubuntu and Linux Mint. Debian is widely used in both desktop and server environments. It is a popular choice for users seeking a reliable and customizable Linux distro for a wide range of applications and use cases, including embedded systems.

 

Debian Testing

Debian also has a testing branch, similar to a beta version, which is an intermediary stage between Debian’s unstable and stable branches. The testing branch is intended for users who want a balance between access to newer software and a relatively stable system. Debian Testing gets new features and fixes before the stable Debian release so there might be issues to troubleshoot in exchange for access to the latest and greatest features, some of which make their way into the stable Debian release.

 

Ubuntu Community Edition*

Often referred to as simply “Ubuntu,” this distro is widely used due to its user-friendly experience, robust software ecosystem, and active community support. It is a solid choice for both desktop and server Linux, and enterprise use. Like Debian, Ubuntu also uses the apt ecosystem for package management and many AI-related packages are included in the distro.

 

Ubuntu Pro

Ubuntu Pro is the commercialized version of Ubuntu known for its ease of use, regular updates, and compatibility with cloud environments. There are versions optimized for different environments, such as Ubuntu Desktop, Ubuntu Server, Ubuntu for IoT, and Ubuntu Cloud. Ubuntu attracts front-end developers with easy-to-use features and a slew of programming resources, including AI libraries.

 

Linux Mint

Linux Mint strives to provide a stable, user-friendly experience for both Linux newcomers and experienced users. It is based on Ubuntu and Debian, building upon their foundations while adding additional features and design elements. Linux Mint emphasizes convenience and provides a traditional desktop experience with a lot of customization options. It also was designed to help Windows users seamlessly transition to a Linux OS.

SUSE Distributions

OpenSUSE Leap*

OpenSUSE Leap is a community-driven distro that combines the stability of a fixed release model with the availability of up-to-date software packages. It provides a reliable and user-friendly operating system for both desktop and server environments. OpenSUSE is generally considered to be stable for production use, and those familiar with the SLES, SUSE, and Slackware ecosystem will feel comfortable in this environment. OpenSUSE focuses on deployment simplicity, user-friendly toolchain, and cloud-readiness.

 

OpenSUSE Tumbleweed*

Tumbleweed is the OpenSUSE community’s rolling release distro. Just as in CentOS Stream, bug fixes and security patches come earlier than in OpenSUSE Leap, the regular release distro, but there also could be some features that are not quite ready for primetime. Tumbleweed supports a wide range of desktop environments, software libraries, and tools.

 

SUSE Linux Enterprise Server (SLES)

SLES is the commercial counterpart to the OpenSUSE Linux distros and is backed by SUSE, a German-based multinational enterprise. It is an enterprise-focused distribution with a strong emphasis on reliability, scalability, and high-performance computing. It offers features like Systemd, Btrfs, and containers support, making it suitable for various server and virtual environments.

Other Open Source Linux Distributions

 

Arch Linux

Arch Linux is a rolling, lightweight Linux distro that is highly customizable and emphasizes simplicity, minimalism, and a DIY approach. It is a better fit for experienced Linux users who want to build a tailored and efficient OS environment according to their specific needs. Its rolling release model provides continuous updates to the latest software packages and features without the need for major version upgrades. Arch Linux is popular among developers and Linux enthusiasts (aka “power users”) who enjoy experimenting with and fine-tuning their Linux system.

 

Alpine Linux

Alpine Linux is a security-oriented lightweight Linux distro designed for resource efficiency and containerization. It is known for its small footprint, speed, and focus on security measures. Alpine Linux is often used in scenarios where size and security are critical, such as in containers, IoT devices, and embedded systems. Alpine Linux is particularly suitable for scenarios where fast boot times, small memory usage, and strong security are required.

 

Amazon Linux

Amazon Linux is AWS’s Linux distro intended for use in Amazon Elastic Compute Cloud environments (EC2). It is offered as pre-configured Amazon Machines Images (AMI) ready to use in AWS. Originally built from RHEL, the distro is now derived from CentOS Stream, and the source code is publicly available and distributed under open source licenses.

Final Thoughts

Hopefully it is clear by now that choosing the best Linux distro for your organization will likely take some time and research. Considering what each offering can help your business achieve, and where you might find friction in implementation is key to succeeding with your next open source Linux distro. Make sure you think about intended use cases, the skills required, and learning curve. Tooling (such as package management) is important to evaluate, along with ecosystem, compatibility, and vendor lock-in risk.

One way to avoid vendor lock-in but still get the security and support you need is to partner with a third party like OpenLogic. Our Enterprise Linux support is guaranteed by SLAs and every ticket is handled by an Enterprise Architect with at least 15 years of Linux experience. We also offer migration services – from consulting to executing the migration itself.

Editor’s Note: This blog was originally published in January 2021. It was updated in February 2025 to reflect changes in the open source Enterprise Linux landscape.

Looking For Migration Services or Support?

OpenLogic offers CentOS migration services and technical support, backed by SLAs, for AlmaLinux, Rocky Linux, CentOS Stream, Ubuntu, Debian, Oracle Linux, and more. Talk to an expert today to get started.

Talk to an Expert  See Datasheet

 

About Perforce
The best run DevOps teams in the world choose Perforce. Perforce products are purpose-built to develop, build and maintain high-stakes applications. Companies can finally manage complexity, achieve speed without compromise, improve security and compliance, and run their DevOps toolchains with full integrity. With a global footprint spanning more than 80 countries and including over 75% of the Fortune 100, Perforce is trusted by the world’s leading brands to deliver solutions to even the toughest challenges. Accelerate technology delivery, with no shortcuts.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

Open Source Big Data Infrastructure: Key Technologies for Data Storage, Mining, and Visualization

Big Data infrastructure refers to the systems (hardware, software, network components) and processes that enable the collection, management, and analysis of massive datasets. Companies that handle large volumes of data constantly coming in from multiple sources often rely on open source Big Data frameworks (i.e. Hadoop, Spark), databases (i.e. Cassandra), and stream processing platforms (i.e. Kafka) as the foundation of their Big Data infrastructure. In this blog, we’ll explore some of the most commonly used technologies and methods for data storage, processing, mining, and visualization in an open source Big Data stack. 

Data Storage and Processing

The primary purpose of Big Data storage is to successfully store vast amounts of data for future analysis and use. A scalable architecture that allows businesses to collect, manage, and analyze immense sets of data in real-time is essential. 
  Big Data storage solutions are designed to address the speed, volume, and complexity of large datasets. Examples include data lakes, warehouses, and pipelines, all of which which can exist in the cloud, on-premises, or in an off-site physical location (which is referred to as colocation storage).

Data Lakes

Data lakes are centralized storage solutions that process and secure data in its native format without size limitations. They can enable different forms of smart analytics, such as machine learning and visualizations.

Data Warehouses

Data warehouses aggregate datasets from different sources into a single storage unit for robust analysis, data mining, AI, and more. Unlike a data lake, data warehouses have a three-tier structure for storing data.

Data Pipelines

Data pipelines gather raw data from one or more sources, potentially merge and transform it in some way, and then transport it to another location, such as lakes or warehouses.

Related Technologies

No matter where data is stored, at the heart of any Big Data stack is the processing framework. One prominent open source example is Apache Hadoop, which allows for the distributed processing of large datasets across clusters of computers. Hadoop has been around for a long time, but is still popular especially for non-cloud-based solutions. It can be seamlessly coupled with other open source data technologies like Hive or HBase for a more comprehensive implementation to meet business requirements. 

Data Mining

Data mining is defined as the process of filtering, sorting, and classifying data from large datasets to reveal patterns and relationships, which helps enterprises identify and solve complex business problems through data analysis. 
Machine learning (ML), artificial intelligence (AI), and statistical analysis are the crucial data mining elements that are necessary to scrutinize, sort, and prepare data for deeper analysis. Top ML algorithms and AI tools have enabled the easy mining of massive datasets, including customer data, transactional records, and even log files picked up from sensors, actuators, IoT devices, mobile apps, and servers.   Every data science application demands a different data mining approach. Pattern recognition and anomaly detection are two of the most well-known and both employ a combination of techniques to mine data.Let’s look at some of the fundamental data mining techniques commonly used across industry verticals.  

Association Rule

The association rule refers to the if-then statements that establish correlations and relationships between two or more data items. The correlations are evaluated using support and confidence metrics, where support determines the frequency of occurrence of data items within the dataset, and confidence relates to the accuracy of if-then statements. For example, while tracking a customer’s behavior when purchasing online items, an observation is made that the customer generally buys cookies when purchasing a coffee pack. In such a case, the association rule establishes the relation between two items (cookies and coffee packs), and forecasts future buys whenever the customer adds the coffee pack to the shopping cart.  

Classification

The classification data mining technique classifies data items within a dataset into different categories. For example, vehiclescan be grouped into different categories, such as sedan, hatchback, petrol, diesel, electric, etc., based on attributes such as the vehicle’s shape, wheel type, or even number of seats. When a new vehicle arrives, it can be categorized into various classes depending on the identified vehicle attributes. The same classification strategy can be applied to categorize customers based on factors like age, address, purchase history, and social group.  

Clustering

Clustering data mining techniques group data elements into clusters that share common characteristics. Data pieces get clustered into categories by simply identifying one or more attributes. Some of the well-known clustering techniques are k-means clustering, hierarchical clustering, and Gaussian mixture models.  

Regression

Regression is a statistical modeling technique using previous observations to predict new data values. In other words, it is a method of determining relationships between data elements based on the predicted data values for a set of defined variables. This category’s classifier is called the “Continuous Value Classifier.”  

Sequence & Path Analysis

One can also mine sequential data to determine patterns, wherein specific events or data values lead to other events in the future. This technique is applied for long-term data as sequential analysis is key to identifying trends or regular occurrences of certain events. For example, when a customer buys a grocery item, you can use a sequential pattern to suggest or add another item to the basket based on the customer’s purchase pattern.  

Neural Networks

Neural networks technically refer to algorithms that mimic the human brain and try to replicate its activity to accomplish a desired goal or task. These are used for several pattern recognition applications that typically involve deep learning techniques. Neural networks are a product of advanced machine learning research.  

Prediction

The prediction data mining technique is typically used for predicting the occurrence of an event, such as machinery failure or a fault in an industrial component, a fraudulent event, or company profits crossing a certain threshold. Prediction techniques can help analyze trends, establish correlations, and do pattern matching when combined with other mining methods. Using such a mining technique, data miners can analyze past instances to forecast future events.  

Related Technologies

When it comes to data mining tasks, open source technologies like Spark, YARN or Oozie are great engines that use flexible and powerful Map Reduction and batching techniques.

Data Visualization

Data visualization is the graphical representation of information and data. With visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
As more companies increasingly depend on their Big Data to make operational and business-critical decisions, visualization has become a key tool to make sense of the trillions of rows of data generated every day.   Data visualization helps tell stories by curating data into a medium that is easier to understand. A good visualization removes the noise from data and highlights the useful information, like trends and outliers.   However, it’s not as simple as just dressing up a graph to make it look better or slapping on the “info” part of an infographic. Effective data visualization is a delicate balancing act between form and function. The plainest graph could be too boring to catch any notice, or it could make a powerful point; likewise, the most stunning visualization could utterly fail at conveying the right message or it could speak volumes. The data and the visuals need to work together, and there’s an art to combining great analysis with great storytelling.  

Related Technologies

The open source software that best responds to these needs is Grafana, which encompasses all basic visualization elements. With a tool like Grafana, a business will be able to effectively monitor their Big Data implementation, and let data visualizations drive informed decisions, enhance system performance, and streamline troubleshooting.    

Final Thoughts

While we’ve covered some of the fundamentals of Big Data infrastructure here, it should go without saying that there is much more to this topic than can be covered in a single blog post. It’s also worth noting that implementing and maintaining Big Data infrastructure requires a high level of technical expertise. These technologies are among the most complex, which is why companies that lack the in-house capabilities often turn to third parties for commercial support and/or Big Data platform administration. Investing in a Big Data platform can deliver big rewards, but only if it’s backed by a solid Big Data strategy and managed by individuals who have the necessary skills and experience.

Unlock the Power of Your Big Data

If you need to modernize your Big Data infrastructure or have questions about administering or supporting technologies like Hadoop, our Enterprise Architects can help. Talk to a Big Data expert

About Perforce
The best run DevOps teams in the world choose Perforce. Perforce products are purpose-built to develop, build and maintain high-stakes applications. Companies can finally manage complexity, achieve speed without compromise, improve security and compliance, and run their DevOps toolchains with full integrity. With a global footprint spanning more than 80 countries and including over 75% of the Fortune 100, Perforce is trusted by the world’s leading brands to deliver solutions to even the toughest challenges. Accelerate technology delivery, with no shortcuts.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

5 Reasons Why Companies Choose OpenLogic to Support Their Open Source

As shown in the State of Open Source Report, organizations around the world today are consuming and contributing to open source software (OSS) more than ever before. But successfully deploying open source in mission-critical applications requires a dependable partner for expert technical support and professional services. 

In this blog, see the top 5 reasons why companies choose OpenLogic by Perforce and how we help them harness the innovative potential of open source while mitigating risk. 

 

Why Companies Need OSS Support

According to the most recent State of Open Source Report, the #1 reason organizations regardless of size, geographic region, or industry are using OSS is because there is no license cost and it saves them money.  

However, while community open source software is free to use, you still have to know how to use it. Year after year, the State of Open Source Report shows that finding personnel with the skills and experience needed to integrate, operate, and maintain open source  technologies is a constant challenge. Self-support quickly becomes cumbersome and unsustainable, and community forums and documentation can only take you so far.  

This is why many organizations taking advantage of the cost-effectiveness of OSS also invest in third-party support from a commercial vendor like OpenLogic 

The Top 5 Reasons Companies Choose OpenLogic for OSS Support

For more than 20 years, OpenLogic has offered expert OSS technical support and professional services (i.e. consulting, migrations, training) to organizations around the world. Below are insights from customers sharing what made them pick OpenLogic as their OSS partner. 

1. One Vendor Who Can Support All the OSS in Your Stack

OpenLogic supports  400+ open source technologies including top Enterprise Linux distributions, databases and Big Data technologies, frameworks, middleware, DevOps tooling, and more. For our customers, we are a one-stop shop for most (if not all) of the OSS used in their development and production environments.    

One of the drawbacks of the commercialization of OSS is that organizations can end up working with multiple support vendors, sometimes a dozen or more — which leads to finger-pointing and delayed resolution when something goes awry. Another concern is vendor lock-in when organizations are subject to price increases or required to work only with the services and integrations in their vendors’ ecosystems.  

OpenLogic solves both of these problems. Organizations can consolidate their support by partnering with one vendor capable of supporting all the OSS in their stack while maintaining the freedom to switch technologies whenever they want.  

2. Consistent, Direct Support From Experienced Enterprise Architects

Lack of internal skills and staff churn can prevent organizations from being able to unlock the full power of OSS. For large organizations, the personnel may be available, but they do not always have the proficiency required to manage the latest technologies. OpenLogic bridges these gaps by giving customers a direct pipeline to a best-in-class team of experts with full-stack expertise.  

Unlike many tech support call centers, OpenLogic customers interact directly with Enterprise Architects with at least 15 years of experience on every support ticket. Our experts have worked hands-on with complex deployments, so whether customers need assistance with upgrades between releases, adjusting configurations for critical scalability, or troubleshooting performance issues, they benefit immediately from the breadth and depth of our team’s technical knowledge.  

Explore OpenLogic Pricing and Plans

For two decades, OpenLogic has partnered with Fortune 100 companies to drive growth and innovation with open source software. Click the button below to receive a custom quote for technical support, LTS, or professional services.

Request Pricing

 

3. Meet Compliance Requirements With SLA-Backed Support

Compliance refers to both internal controls and external requirements that protect an organization’s IT infrastructure. PCI-DSS, CIS Controls, ISO 27001, GDPR, FedRAMP, HIPAA, and other regulations require fully supported software and updates to the latest releases and security patches, and there are no exceptions for open source software. 

Keeping up with updates and patches is an ongoing struggle for organizations using OSS. OpenLogic’s deep expertise with OSS release lifecycles — and history of providing long-term support for end-of-life software like CentOS, AngularJS, and Bootstrap — is one of the biggest reasons why organizations choose to work with us. Partnering with OpenLogic makes it easier to stay compliant and pass IT audits because they have technical support and LTS guaranteed by enterprise-grade SLAs for response and resolution times.   

 

4. Expertise Integrating Open Source Packages Into Full Stack Deployments

Integration and interoperability among all the OSS in most tech stacks is seldom straightforward. Even with mature and stable open source infrastructure software, the interrelation between components is often complex enough to necessitate assistance from OpenLogic’s experts. 

Most support tickets are not opened because of a bug in the software. It’s more common for issues that touch two or more technologies to arise — and that’s when having a single vendor with full stack operational expertise is advantageous. We can troubleshoot and get you back to full functionality faster because we can holistically assess what’s happening across your entire stack.  

 

5. Unbiased Guidance Regardless of Infrastructure or Environment

Because OpenLogic is software-agnostic, customers can count on our Enterprise Architects to provide unbiased recommendations based on their specific needs rather than on sponsorships or commercial interests. We will always suggest the technologies that make sense for your business, not ours.     

We also understand that today’s organizations host their applications in diverse environments, including on-premises, public clouds, and in hybrid environments, as well as using bare metal, virtual machines, or containers. OpenLogic supports customers regardless of their infrastructure or environment; there are no platform restrictions or limitations in the amount of support provided, and we’ll never pressure you to migrate to a public cloud in order to receive our services.  

Final Thoughts

Supporting all your open source packages internally can put a drain on resources and take developers’ focus away from where it should be: innovating for your business. Partnering with OpenLogic allows you to take advantage of free community open source but with the added security of guaranteed SLAs and 24/7 support delivered by experts with deep OSS expertise.  

About Perforce
The best run DevOps teams in the world choose Perforce. Perforce products are purpose-built to develop, build and maintain high-stakes applications. Companies can finally manage complexity, achieve speed without compromise, improve security and compliance, and run their DevOps toolchains with full integrity. With a global footprint spanning more than 80 countries and including over 75% of the Fortune 100, Perforce is trusted by the world’s leading brands to deliver solutions to even the toughest challenges. Accelerate technology delivery, with no shortcuts.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

Navigating Software Dependencies and Open Source Inventory Management

Keeping track of software dependencies is not an easy task and only becomes more difficult as companies scale. In this blog, we explore the types of dependencies and complications they can cause, as well as available tools and best practices organizations can adopt to improve their open source inventory management.

 

Understanding Software Dependencies

Software dependency management is a hot topic and an ongoing area for learning and process improvements. They are the byproduct of code collaboration and sharing, and all of us who consume and/or contribute to OSS are potential victims of the consequences if dependencies aren’t properly managed. And while not unique to open source software, the rapid proliferation of open source technologies has made tracking software dependencies more complex.

There are two main categories of software dependencies:

  • Direct dependencies: This refers to frameworks, libraries, modules, and other software components that an application deliberately and “directly” references to address a solved problem.
  • Transitive dependencies: This refers to the cascading list of those independent pieces of software that the direct dependencies in turn include to function properly.

Beyond that, there are some distinctions within those two main categories that are good to be aware of before defining a dependency management strategy:

  • Internal vs. External: Some dependencies may be owned and controlled internally by a development team, though typically the vast majority are created and maintained externally.
  • Open vs. Closed: Referenced dependencies may be open source allowing development team investigation and ownership by proxy, or they might be binary-only licensed from a vendor where changes are managed through contractual terms.
  • Idle vs. Engaged: As the application source evolves, needs change, rendering some dependencies irrelevant. However, they are not always removed from the dependency chain. As a result, some dependencies are actively engaged and used, whereas others are no longer used and remain bundled but idle.

A software inspection methodology that includes inventorying dependencies and managing lifecycles is essential to system security and sustainability. An up-to-date software inventory is necessary for identifying vulnerable or end-of-life components, and identification is the first step in remediating issues and mitigating risks.

The Challenges of Dependency Management

Today there is an ever-increasing demand for both speed and innovation with regards to software development, and that is both the catalyst for, and the result of, open source software. This demand has also produced software delivery concepts like microservices and container orchestration that require vast amounts of integration points – all of which contribute to the chain of software dependencies. This has ushered in a host of software maintenance problems that require dependency management solutions.

The main challenges that arise are due to the pace of change. It is increasingly more difficult for organizations to keep up with evolving software, as well as the companies, communities, and licensing bodies that maintain and govern them. Some examples:

  • Version conflicts: When multiple dependencies within the same application require different versions of a shared library.
  • Compatibility issues: When updating a package can introduce breaking changes that require modifications to your application to maintain existing functionality.
  • Security vulnerabilities: When a downstream dependency has a known security defect that either needs to be addressed by your application or requires an update to the dependency to remediate it.
  • End-of-Life problems: When the referenced software package is no longer maintained by the vendor or community, which can result in security defects that are not remediated and leave your application vulnerable to attacks.
  • License compliance: When the application uses another software component in a way that is not allowed by the software license. This can sometimes happen as the result of a license change as versions of the dependency are upgraded.
  • Idle bloat: When an application has a growing number of unreferenced dependencies that increase the size, complexity, and liability without adding value.

Few developers are privy to all dependency management best practices, and most teams are not equipped with the tooling necessary to mount a proactive approach to avoid dependency problems. Gone are the days when a development team would settle on a single programming language that allowed them to use a particular package manager (e.g. python:pip, java:maven, javascript:npm, rpm:yum) to list the dependency tree, checklists to track the inventory, and unit tests to validate upgrades. Professionalizing a software development practice now requires modern systems for tackling software dependency management at scale.

Unbiased Guidance. SLA-Backed Support.

For more than two decades, OpenLogic has partnered with enterprises to help them get the most from their OSS. From migrations to technical support, we can tackle the toughest open source challenges — freeing up your team to focus on innovating for your business.

Let’s Talk

How to Track Software Dependencies and Manage Your Open Source Inventory

Unfortunately as of this writing, there is no silver bullet in this space. In fact, there is not even a best-in- class solution that has emerged. The good news is, there are software organizations and communities that recognize the problem, and are developing strong solutions to address pieces of this puzzle. Gluing them together can produce an effective system, which is the best path forward for now.

Software Dependency Management Tools

There are a few cornerstone tools that lay the foundation for a modern software dependency management system:

  • A central code repository that supports revision control and release versioning (e.g. Git, Github, Gitlab, Helix Core). This is the foundation for dependency discovery, and it can also save and manage lock files that tie an application to a specific version of a dependency.
  • A package manager for each programming language or platform (e.g. python:pip, java:maven, javascript:npm, rpm:yum). These tools will handle the interactions (push, pull, install, update, list, etc.) with a dependency repository.
  • A Software Bill of Materials (SBOM) generator (Syft, SBOM Tool, Tern, CycloneDX). This will produce an attributed inventory of all the software components in your applications (including supplier name, component name, component author, component version, dependency relationship, governing license(s), etc.).
  • A vulnerability scanner that supports scheduled detection scans and notification schemes (e.g. Trivy, Grype). This tool will schedule automatic scans that identify security issues and provide detailed reports (i.e. risk prioritization, remediation guidance) that help assess the impact to all direct and transient dependencies referenced by your application.

6 Dependency Management Best Practices

The tools above should be augmented by some best practices that can be implemented and enforced through internal policies, processes, and procedures. These six best practices are a good place to start:

  1. Create a central artifact repository to capture the software inventory with key attributes, notes, and links to additional details in related systems (i.e roadmapping, issue tracking systems, risk management, contracts).
  2. Define a clear dependency policy that lists acceptable and unacceptable sources and specific approved lists of software components, along with guidelines for gaining approval for components that fill new needs.
  3. Establish update and upgrade policies that describe the tooling used to scan for dependency vulnerabilities and lifecycle attributes with guidance on how to prioritize, schedule, and apply/defer the scanner’s findings.
  4. Develop a training curriculum to educate developers and others in the organization on the need for ongoing diligence around dependency management and the topics, tools, and techniques required to deliver and maintain a healthy application.
  5. Adopt a versioning scheme (i.e. semantic versioning) that allows the organization to track the alignment of dependencies to a particular version of an internal application.
  6. Require formal code reviews and testing that includes a dependency review geared toward heading off the common challenges identified above (e.g. version conflicts – idle bloat).

Final Thoughts

Software has become progressively more complex and the need for speed has driven more code-sharing and reuse. Developers have to rely on available packages to handle solved problems, so they can focus on new challenges that advance their particular mission. And unfortunately, sometimes tracking all the dependencies in those packages gets lost in the DevOps shuffle. Hopefully, this blog offers some actionable steps to make your approach to dependency and open source inventory management a little more sophisticated.

 

About Perforce
The best run DevOps teams in the world choose Perforce. Perforce products are purpose-built to develop, build and maintain high-stakes applications. Companies can finally manage complexity, achieve speed without compromise, improve security and compliance, and run their DevOps toolchains with full integrity. With a global footprint spanning more than 80 countries and including over 75% of the Fortune 100, Perforce is trusted by the world’s leading brands to deliver solutions to even the toughest challenges. Accelerate technology delivery, with no shortcuts.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

Apache Spark vs. Hadoop: Key Differences and Use Cases

Apache Spark vs. Hadoop isn’t the 1:1 comparison that many seem to think it is. While they are both involved in processing and analyzing Big Data, Spark and Hadoop are actually used for different purposes. Depending on your Big Data strategy, it might make sense to use one over the other, or use them together.

In this blog, our expert breaks down the primary differences between Spark vs. Hadoop, considering factors like speed and scalability, and the ideal use cases for each.

 

What Is Apache Spark?

Apache Spark was developed in 2009 and then open sourced in 2010. It is now covered under the Apache License 2.0. Its foundational concept is a read-only set of data distributed over a cluster of machines, which is called a resilient distributed dataset (RDD).

RDDs were developed due to limitations in MapReduce computing, which read data from disk by reducing the results into a map. RDDs work faster on a working set of data which is stored in memory which is ideal for real-time processing and analytics. When Spark processes data, the least-recent data is evicted from RAM to keep the memory footprint manageable since disk access can be expensive.

What Is Apache Hadoop?

Hadoop is a data-processing technology that uses a network of computers to solve large data computation via the MapReduce programming model.

Compared to Spark, Hadoop is a slightly older technology. Hadoop is also fault tolerant. It knows hardware failures can and will happen and adjusts accordingly. Hadoop splits the data across the cluster and each node in the cluster processes the data in parallel very similar to divide-and-conquer problem solving.

For managing and provisioning Hadoop clusters, the top two orchestration tools are Apache Ambari and Cloudera Manager. Most comparisons of Ambari vs. Cloudera Manager come down to the pros and cons of using open source or proprietary software.

Apache Spark vs. Hadoop at a Glance

The main difference between Apache Spark vs. Hadoop is that Spark is a real-time data analyzer, whereas Hadoop is a processing engine for very large data sets that do not fit in memory.

Hadoop can handle batching of sizable data proficiently, whereas Spark processes data in real-time such as streaming feeds from Facebook and Twitter/X. Spark has an interactive mode allowing the user more control during job runs. Spark is the faster option for ingesting real-time data, including unstructured data streams.

Hadoop is optimal for running analytics using SQL because of Hive, a data warehouse system that is built on top of Hadoop. Hive integrates with Hadoop by providing an SQL-like interface to query structured and unstructured data across a Hadoop cluster by abstracting away the complexity that would otherwise be required to write a Hadoop job to query the same dataset. Spark also has a similar interface, Spark SQL, which is part of the distribution and does not have to be added later.

Get SLA-Backed Support for Hadoop or Spark

Managing a Big Data implementation can be challenging if you don’t have the right internal resources. Our Big Data experts can provide 24/7 technical support and professional services (upgrades, migrations, and more) so you can focus on leveraging the insights from your data.

Talk to a big data Expert

Spark vs. Hadoop: Key Differences

In this section, let’s compare the two technologies in a little more depth.

Ecosystem

The core computation engines of Hadoop and Spark differ in the way they process data. Hadoop uses a MapReduce paradigm that has a map phase to filter and sort data and a reduce phase for aggregating and summarizing data. MapReduce is disk-based, whereas Spark uses in-memory processing of Resilient Distributed Datasets (RDDs), which is great for iterative algorithms such as machine learning and graph processing.

Hadoop comes with its own distributed storage system, the Hadoop Distributed File System (HDFS), which is designed for storing large files across a cluster of machines. Spark can use Hadoop’s HDFS as its primary storage system, but it also supports other storage systems like S3, Azure Blob Storage, Google Cloud Storage, Cassandra, and HBase.

Hadoop and Spark include various data processing APIs for different use cases. Spark Core provides functionality for Spark jobs like task scheduling, fault tolerance, and memory management. Spark SQL allows SQL-like queries on large datasets and integrates well with structured data. It supports querying both structured and semi-structured data. The Spark Streaming component provides real-time stream processing by dividing data streams into small batches. MLlib and GraphX are libraries for machine learning algorithms and graph processing, respectively, that run on Spark.

Hadoop includes MapReduce, which is the core API for data processing in Hadoop.  The following tools can be added to Hadoop for data processing:

  • Apache Hive is a data warehouse system built on top of Hadoop for querying and managing large datasets using a SQL-like language.

  • Apache HBase is a distributed NoSQL database that runs on top of HDFS and is used for real-time access to large datasets.

  • Apache Pig is a platform for analyzing large datasets that uses a scripting language (Pig Latin) to express data transformations.

For cluster management, YARN (Yet Another Resource Manager) is the most common approach to run Spark applications to run transparently in tandem with Hadoop jobs in the same cluster which provides resource isolation, scalability, and centralized management.

Spark does have a few more cluster management configurations than Hadoop.  Apache Mesos is a distributed systems kernel that can run Spark, and Spark also has native support for Kubernetes, which can be used for containerized deployment and scaling capabilities in Spark clusters.

For fault tolerance, Hadoop has data block replication that ensures data accessibility if a node fails, and Spark uses RDDs to reconstruct data in the event of failure.

Real-time processing and machine learning are both included with Spark. Spark Streaming natively supports real-time data processing with low latency, but Hadoop requires tools like Apache Storm or Apache Flink to accomplish this task. MLLib is Spark’s machine learning library, and Apache Mahout can be used with Hadoop for machine learning.

Features

Hadoop has its own distributed file system, cluster manager, and data processing. In addition, it provides resource allocation and job scheduling as well as fault tolerance, flexibility, and ease of use.

Spark includes libraries for performing sophisticated analytics related to machine learning, AI, and a graphing engine. The scheduling implementation between Hadoop and Spark also differs. Spark provides a graphical view of where a job is currently running, has a more intuitive job scheduler, and includes a history server, which is a web interface to view job runs.

Performance and Cost Comparison

Hadoop accesses the disk frequently when processing data with MapReduce, which can yield a slower job run. In fact, Spark has been benchmarked to be up to 100 times faster than Hadoop for certain workloads.

However, because Spark does not access to disk as much, it relies on data being stored in memory. Consequently, this makes Spark more expensive due to memory requirements. Another factor that makes Hadoop more cost-effective is its scalability; Hadoop mixes nodes of varying specifications (e.g. CPU, RAM, and disk) to process a data set. Cheaper commodity hardware can be used with Hadoop.

Other Considerations

Hadoop requires additional tools for Machine Learning and streaming which come included in Spark. Hadoop can also be very complex to use with its low-level APIs, while Spark abstracts away these details using high-level operators. Spark is generally considered to be more developer-friendly and easy to use.

Spark Use Cases

Spark is great for processing real-time, unstructured data from various sources such as IoT, sensors, or financial systems and using that for analytics. The analytics can be used to target groups for campaigns or machine learning. Spark has support for multiple languages like Java, Python, Scala, and R, which is helpful if a team already has experience in these languages.

Hadoop Use Cases

Hadoop is great for parallel processing of diverse sets of large amounts of data. There is no limit to the type and amount of data that can be stored in a Hadoop cluster. Additional data nodes can be added to address this requirement. It also integrates well with analytic tools like Apache Mahout, R, Python, MongoDB, HBase, and Pentaho.

It’s also worth noting that Hadoop is the foundation of Cloudera’s data platform, but organizations that want to go 100% open source with their Big Data management and have a little more control over where they host their data should consider the Hadoop Service Bundle as an alternative.

Using Hadoop and Spark Together

Using Hadoop and Spark together is a great way to build a powerful, flexible big data architecture. Typical use cases are large-scale ETL pipelines, data lakes and analytics, and machine learning. Hadoop’s scalable storage via HDFS can be used for storing large datasets and Spark can perform distributed data processing and analytics. Hadoop jobs can be used for large and long-running batch processes, and Spark can read data from HDFS and perform complex transformations, machine learning, or interactive SQL queries. Spark jobs can run on top of a Hadoop cluster using Hadoop YARN as the resource manager. This leverages both Hadoop’s storage and Spark’s faster processing, combining the strengths of both technologies.

Final Thoughts

Organizations today have more data at their disposal than ever before, and both Hadoop and Spark have a solid future in the realm of open source Big Data infrastructure. Spark has a vibrant and active community including 2,000 developers from thousands of companies which include 80% of the Fortune 500.

For those thinking that Spark will replace Hadoop, it won’t. In fact, Hadoop adoption is increasing, especially in banking, entertainment, communication, healthcare, education, and government. It’s clear that there’s enough room for both to thrive, and plenty of use cases to go around for both of these open source technologies.

Editor’s Note: This blog was originally published in 2021 and was updated and expanded in 2025. 

 

About Perforce
The best run DevOps teams in the world choose Perforce. Perforce products are purpose-built to develop, build and maintain high-stakes applications. Companies can finally manage complexity, achieve speed without compromise, improve security and compliance, and run their DevOps toolchains with full integrity. With a global footprint spanning more than 80 countries and including over 75% of the Fortune 100, Perforce is trusted by the world’s leading brands to deliver solutions to even the toughest challenges. Accelerate technology delivery, with no shortcuts.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

Open Source Trends and Predictions for 2025

It’s a new year, which is a good time to reflect on what changed in the never-boring OSS world over the past 12 months — and what 2025 might bring. Read on to see what I expect we’ll be hearing and reading about this year in terms of open source trends.

 

 

Demand for More Data Sovereignty

More and more organizations are streaming and processing large data sets in realtime, for reasons ranging from observability into manufacturing processes and sentiment analysis of social media, to routing and processing financial transactions and training Large Language Models for AI applications.

Big Data technologies are complex, often requiring both specialized IT operations teams as well as infrastructure architects. As a result, many companies have turned to managed solutions in order to offload this work so their own teams can focus on the data and data analysis itself. However, many of these managed solutions have started adding non-optional features, requiring public cloud deployment, and dramatically increasing their pricing structure, often without transparency to their customers. Additionally, customers are running into compliance issues, as new regulatory requirements mandating how and where data is processed and stored are sometimes incompatible with these platforms.

Since many of these solutions are based on existing OSS technologies such as Hadoop, Kafka, and others, we expect to see companies rethinking their Big Data strategy, looking for ways to achieve data sovereignty by bringing their Big Data solutions in-house with open source software, and partnering with commercial support vendors as needed to aid in architecture and management.

Related >> Is It Time to Open Source Your Big Data Management? 

The Search for the Next CentOS Continues

On June 30, 2024, we saw a milestone in the Enterprise Linux ecosystem as CentOS 7 reached end of life. While a number of commercial offerings emerged to allow CentOS users to postpone their migrations, these are short-term solutions, and eventually companies will need to migrate to new distributions.

As CentOS was itself a 1-to-1 replacement for Red Hat Enterprise Linux (RHEL), this of course remains an option. However, this ignores one of the main reasons for using CentOS: the fact that you could use it without support contracts, or contract with third parties for support, often at steep discounts over Red Hat.

Several CentOS alternatives have emerged in the past few years, including AlmaLinux and Rocky Linux, providing essentially the same 1:1 OSS counterpart to RHEL that CentOS provided. Like CentOS, these distros are community-supported, and both are relatively new, with an unproven track record of support that makes some enterprise organizations nervous.

Additionally, many businesses have become increasingly security-minded in the last few years, due to a variety of CVE announcements against OSS software as well as supply chain attacks. A freely available Linux distribution is often not enough for these companies; they also need a secure baseline image to start from in order to streamline the security measures they need to take to protect their software. While commercial solutions such as RHEL, Oracle Linux, and SUSE Linux provide these, they come at substantial cost.

All of which is to say, there is still no clear victor in the so-called “Linux Wars” but as more companies migrate off CentOS in 2025, we’ll probably have a better sense of whether security or cost-effectiveness is the bigger driver based on where they end up.

Related >>How to Find the Best Linux Distro For Your Organization

Open Source AI Enters the Next Phase

AI has become the technology du jour, replacing previously trending topics such as the metaverse and cryptography. Technically speaking, most of the technology around AI today is around Large Language Models (LLMs) and Generative AI, which use statistical models in order to determine what to do next, whether that’s completing a conversational prompt, splicing together images, or other use cases.

Generative AI models require large amounts of training, with large amounts of data — which means that it falls under the umbrella of Big Data when it comes to open source. The need to keep these processes and technologies secure and performant is paramount — and just like with Big Data, the amount of expertise is spread thin.

AI is a hugely competitive market and that’s not going to change in 2025. There are a variety of toolchains already available for training LLMs and other models within Big Data pipelines, with tools such as Apache Spark, Apache Kafka, and Apache Cassandra providing key functionality used to train these models. I anticipate seeing more companies developing bespoke LLMs that directly support the products they produce, and they will use open source toolchains to do this.

Related >>Open Source and AI: Using Cassandra, Kafka, and Spark for AI Use Cases

Lessons From the XZ Utils Backdoor

In 2024, the security world was rocked by the discovery of a malicious backdoor in the xz utility, and attention was turned to staving off future supply chain attacks.

Supply chain attacks? But isn’t xz an open source utility?

In this particular case, an individual had used social engineering to very gradually, over multiple years, take over maintenance of the open source project producing xz. Once they had, they slipstreamed in the backdoor in a release they signed.

While many tried to decry this incident as evidence that open source software is inherently insecure (as this sort of social engineering is always a possibility), there’s another side to the coin: it was an open source packager performing standard benchmarking on a development release of an operating system who uncovered the issue. As the adage goes, many eyeballs make all bugs shallow.

One side effect of this attack was renewed interest in Software Bills of Materials (SBOMs). Organizations that are able to produce an SBOM for their software have a record of what they have installed, including the specific versions, as well as what licenses apply. This provides the ability to audit your software — or your vendor’s software — for known security vulnerabilities, and to react to them more quickly. Many organizations are forming DevSecOps teams to manage building, maintaining, and validating SBOMs against vulnerability lists as part of ongoing security in-depth efforts.

Even better, the OSS community is stepping up to build tooling for producing SBOMs into their development chains and utilities. The Node.js community has several projects that will produce SBOMs from application manifests; PHP’s Composer project added these capabilities; Java’s Maven and Gradle each have plugins to generate SBOMs.

Security is and will continue to be a top concern for companies using open source software, and in 2024, we saw proof that the ecosystem is helping protect them. Whether or not we will have another zero-day attack in 2025 remains to be seen, but companies are recognizing the benefit of being more proactive by embedding security best practices into their development and operations workflows and managing OSS inventory with the assistance of tools like SBOMs.

Support Your Entire Open Source Stack

Companies around the world trust OpenLogic to provide expert technical support for the open source technologies in their infrastructure, including LTS for EOL software. Let our enterprise architects tackle the toughest challenges so your developers can focus on what matters to your business.

Explore solutions 

About Perforce
The best run DevOps teams in the world choose Perforce. Perforce products are purpose-built to develop, build and maintain high-stakes applications. Companies can finally manage complexity, achieve speed without compromise, improve security and compliance, and run their DevOps toolchains with full integrity. With a global footprint spanning more than 80 countries and including over 75% of the Fortune 100, Perforce is trusted by the world’s leading brands to deliver solutions to even the toughest challenges. Accelerate technology delivery, with no shortcuts.

About Version 2 Digital

Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.