As in, when I watched YouTube tutorials, I often see YouTubers have a small widget on their desktop giving them an overview of their ram usage, security level, etc. What apps do you all use to track this?
Netdata, monitoring a few thousand servers (virtual) that way.
The fastest way? Probably netdata
I’ll look into this too. Thank you.
agreed … BY FAR the fastest. Easiest learning curve as well
This. If you have more servers you can also get them all connected to a single UI where you can see all the Infos at once. With netdata cloud
Just set this up yesterday. I used a parent node and then have all my vms point to that. Took like an hour to figure it out
Hey, did you use the cloud functionality or not? I’m tryna go all local with parent-child kind of capability but so far unable to.
I don’t know if I’ll keep running this. Already the child nodes are complaining about increase write delays since installing the agents on them.
The parent still is visible to the cloud portal. My understanding is the data all resides local, but when you login to their cloud portal, it connects to the parent to display the information. I’m still playing with it to confirm. My parent node shows all the child nodes on the local interface but the cloud still shows them all.
It is bit difficult at start, but really in the end you can monitor and get notification on anything thats happening on your system.
Checkmk (Raw - free version.) Some setup aspects are a bit annoying (wants to monitor every last ZFS dataset and takes too long to ‘ignore’ them one by one.) It does alert me to things that could cause issues, like the boot partition almost full. I run it in a Docker container on my (primarily) file server.
I use this as well! Works well and has built in intelligence for thresholds.
When the fan gets loud enough to hear, I’ll check it :P
I recommend Checkmk. https://checkmk.com/
No… Why? Its old, its trash. Might get hate for it, but its just not good.
I second CMK.
A TICK stack is unwieldy, Grafana takes a lot of setup, and all of this assumes you both know what to monitor and get stats on it.
CMK by contrast is plug and play. Install the server on a VM or host, install thr agent on your other systems, and you’re good to go.
I’m running a tick stack with a couple of thousands of servers - way less CPU usage than checkmk/nagios or anything else from the previous millennium …
How do you solve the problem of runaway memory usage? Even monitoring a few dozen hosts, memory usage would grow to many GB and continue to grow indefinitely until it OOM’d, and from my reading Influx has no way to prevent this.
Have you had runaway memory problems with influx, or your apps?
Specifically with Influx.
InfluxDB metrics server and Telegraf agent to collect metrics
I use sar for historical, my own scripts running under cron on the hosts for specific things I’m interested in keeping an eye on and my on scripts under cron on my monitoring machines for alerting me when something’s wrong. I don’t use a dashboard.
I know that it needs a fix when my dad complaining that he can’t watch TV and the rolling door doesn’t open in the morning.
Zabbix. Aslo for Windows, it could be Rainmeter https://www.rainmeter.net/ or HWiNFO https://www.hwinfo.com/. For Linux, Conky.
I use Telegraf + InfluxDB + Grafana for monitoring my home network and systems. Grafana has a learning curve for building panels and dashboards, but is incredibly flexible. I use it for more than server performance. I have a dual-monitor “kiosk” (old Mac mini) in my office displaying two Grafana dashboards. These are:
Network/Power/Storage showing:
- firewall block events & sources for last 12 hrs (from pfSense via Elasticsearch),
- current UPS statuses and power usage for last 12 hrs (Telegraf apcupsd plugin -> InfluxDB),
- WAN traffic for last 12 hrs ( from pfSense via Telegraf -> InfluxDB),
- current DHCP clients (custom Python script -> MySQL), and
- current drive and RAID pool health (custom Python scripts -> MySQL)
Server sensors and performance showing:
- current status of important cron jobs (using Healthchecks -> Prometheus),
- current server CPU usage and temps, and memory usage (Telegraf -> InfluxDB)
- server host CPU usage and temps, and memory usage for last 3 hrs (Telegraf -> InfluxDB)
- Proxmox VM CPU and memory usage for last 3 hrs (Proxmox -> InfluxDB)
- Docker container CPU and memory usage for last 3 hrs (Telegraf Docker plugin -> InfluxDB)
Netdata works really well for system performance for Linux and can be installed from the default repositories of major distributions.
Network/Power/Storage
Pretty cool dashboards. I liked the DHCP clients info, does it also report DHCP reservations?
Where do you do DHCP, on the PFSense or somewhere else?
does it also report DHCP reservations?
Thanks, and yes, Type “static” are DHCP reservations.
Where do you do DHCP, on the PFSense or somewhere else?
Yes, on pfSense. I use the Python function written by pletch/scrape_pfsense_dhcp_leases.py (on Github) that scrapes the pfSense status_dhcp_leases.php page. Then added my own function for querying my TP-Link APs using SNMP to determine which AP a wireless DHCP client is connected to.
I can throw the script up on Dropbox if you are interested. I am mediocre at writing Python, so it is pretty specific to my environment.
We use zabbix here. Zabbix is amazing and we put it in all of our templates so any new servers and hosts pop up on zabbix dashboard preconfigured just like that. For logs and security we use an Elastik “ELK stack” which gives us a heads up if anything is wrong in the logs, and zabbix gives us a head up of the systems health all together. Between the two, our health monitor panel combines the two windows so we can see full server health and any problems right there as a todo list for the IT team
If get ahead of it by getting extra.
Need 16 gb of ram and 8 cores ? Well let me add 64 gb to my cart and 12 core CPU.
Hasn’t failed me
I use Home Assistant already. They have a plugin for glances. I guess all I’m interested in is cpu temp and load. Any changes =somethings up
CheckMK for general monitoring, Grafana/Prometheus for Proxmox-cluster, Wazuh for IDS-purposes and UptimeKuma for general uptime on services. It’s not like it’s necessary, but it’s nice to tinker in my homelab before implementing the same services on a “professional level” at work.
My HomeAssistant is stable, so wifey is not being used as a monitor ;-)