Difference between revisions of "Management"

From CSLabsWiki
(Hardware Implementation:)
(Plans:)
Line 304: Line 304:
 
*select server subnet(s)
 
*select server subnet(s)
 
*add specific server IP's not in the subnets
 
*add specific server IP's not in the subnets
*move to some hardware - to be discussed at a forum near you
 
 
*manage battery backups and tell servers when exactly to power down in the event of an outage
 
*manage battery backups and tell servers when exactly to power down in the event of an outage
 
*add some more master key configurations for fallback mechanisms
 
*add some more master key configurations for fallback mechanisms
 
*add server specific key functions and configurations (such as owner information, contact details, and others)
 
*add server specific key functions and configurations (such as owner information, contact details, and others)
  +
*battery back the pcDuino in such a way that it will sustain itself independent of any UPS and general power failures long enough to outrun all of the other computers in operation and completely isolate itself from a main power supply within a second of when a power anomaly occurs until it determines the power to be a safe voltage. The nice thing about a SBC is that there isn't a really complex electrical system to provide so long as there's at least 4.85 to 5.25 volts of approximately 200mA of power to keep it going, the (dedicated) onboard power regulator handles the rest since it's designed for battery style installations.
 
  +
*make it independent of any server so that it will operate even when Talos is down. So long as there's a gateway and network this thing had better be running.
 
I am thinking of making this move from a VM onto a piece of hardware that is independent of LDAP (as it is now but I am still developing it), with a local sign on so that it can properly manage system statuses without being interfered with from major server downtime, which it is designed to track and not be affected by. The idea is that it can run even when everything else has crashed for one reason or another. (the server room will be on fire and this will tell us that when nothing else works, so long as there is ethernet to broadcast from)
 
 
The main reason I am thinking of a single board computer is because it can easily be battery powered in the case of power outages to manage the battery backups without wasting power in a transformer, and is much safer to maintain and we know exactly how it will perform in power outages.
 
 
Having it in hardware at all allows us to interface to hardware such as battery backup cables and also to be able to take temperature readings of the room itself.
 
 
I would like to use either a pcDuino 3 Nano Lite (I have one and it works rather well - low cost, good specs, and GigE which is a good plus - on it's own bus so that it doesn't crash when USB is pegged like the raspberry pi) or a raspberry pi. I would rather use the pcDuino due to the better specs and the Gig E. Either chip would do the trick though.
 
 
If it does use a battery I think that I would use this https://learn.adafruit.com/adafruit-powerboost/pinouts along with a RC circuit to allow the pcDuino a few seconds to shut off before cutting a power transistor, effectively prohibiting the power draw any further than low battery to preserve the liPo that would be used. I would expect the pcDuino or raspberry pi to draw about 800mA up to 1200mA, depending upon what is connected to the USB ports.
 
 
Idealy, a non powered hub connects all of the battery backups together into one hub, and then the SBC would read each charge level and serial number of the UPS in order to associate the correct device to the correct servers. This way, when the power goes out, we immediately shut down non essential hardware, and when the UPS's go low, we begin to shut down everything in a controlled manner, without any bad halt situations that might cause disk inconsistencies.
 
 
The battery charge circuit, which would also power the board in operation when on utility power, would also be surge protected and shut off immediately when the power begins to fluxuate or do strange things, potentially using a relay to do this with some diodes for voltage protection on a detection line so that should surges happen it doesn't destroy the advanced voltage regulation electronics we create.
 
 
This is a general idea and I write my ideas only to share them, not that they are firm in any way and I ask that others help me in the selection of hardware to make sure that we make good decisions.
 

Revision as of 03:48, 23 February 2016

Management
Cosi-management.png
IP Address(es): 128.153.145.62
Contact Person: Jared Dunbar
Last Update: February 2016
Services: server status indicator


Hostname: management
Operating system: Armbian (Debian) Jessie (kernel 4.4.1-sunxi)
NIC 1: eth0
MAC: 02:8e:08:41:65:6a
IP: 128.153.145.62
CPU: Hard Float Dual Core Allwinner A20 armv7l, Mali 400 MP2
RAM: 1GB DDR3 ECC


Management is a SBC (single board computer) used for monitoring the status of VM's on other machines and the status of the hardware in the server room, ie. checking the CPU, RAM, and hard drive stats, among other configurable things.

Each computer in the server room that will be assigned to this list will have a startup executable written in BASH scripts and C/C++ executables that will send data periodically to Management which will be shown in an uptime page on a webpage that can easily be used to determine system uptime and service uptime among other things.

Currently installed on the machine are the following:

htop openssh-client vim openjdk-7-jdk p7zip-full g++ sudo git

Client Side (runs on a server)

Requirements

g++ top awk tail bash sed free

The source code for the client executable is available online at https://github.com/jrddunbr/management-client

The bash scripts are made wherever necessary (it's expandable and each server can theoretically have as many keys as it wants, each data parameter is stored as a key) and here are some functional examples:

CPU:

#!/bin/bash
DATA=$(
top -bn2 | \
grep "Cpu(s)" | \
sed -n '1!p' | \
sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | \
awk '{print 100 - $1}')
echo $DATA
/manage/management-client 128.153.145.62 80 cpu $DATA

Free-Ram:

#!/bin/bash
DATA=`free -m | awk 'NR==2{printf "%sMB\n", $2-$7 }'`
echo $DATA
/manage/management-client 128.153.145.62 80 used-ram $DATA

Total-Ram:

#!/bin/bash
FREE_DATA=`free -m | grep Mem` 
DATA=`echo $FREE_DATA | cut -f2 -d' '`MB
echo $DATA
/manage/management-client 128.153.145.62 80 total-ram $DATA

Uptime:

#!/bin/bash
DATA=$(uptime -p | sed -e 's/ /_/g')
echo $DATA
/manage/management-client 128.153.145.62 80 uptime "$DATA"

Compiling Managemnet

These scripts expect management-client.cpp to be compiled as

g++ management-client.cpp -o management-client --std=c++11

and to be in the /manage folder (for simplicity, I tend to put them all in the same folder).

Startup

I also have one script that runs all of the client scripts. The Bash script that runs all the other bash scripts looks a lot like this:

Bash Start Script

/manage/run.sh

#!/bin/bash
cd /manage
while true
do
    /manage/cpu.sh &
    /manage/free-ram.sh &
    /manage/total-ram.sh &
    /manage/uptime.sh &
    sleep 20
done

Systemd Start Script

/etc/systemd/system/manage.service:

[Unit]
Description=manage stuff

[Service]
ExecStart=/bin/bash /manage/run.sh

[Install]
WantedBy=multi-user.target

Extensibility

It is easy to make more customized bash scripts that will complete other tasks. The compiled file has an expected input of ./management-client (IP) (PORT) (KEY_NAME) (VALUE) and this causes a key to go up, and saves at the value. When the server gets this as a rest call, the server reads it because it's in the 145 subnet and then sets it into the data structures of the program.

Unfortunately for the time being, the 145 subnet is a hard-coded thing. In future releases, as I have more time to finish this, it will become more functional and more features will arise.

Server Side (management itself)

Requirements

The server side of the software is available at https://github.com/jrddunbr/management-server and is still a work in progress.

It requires the following to be installed:

openjdk-7-jdk wget

Setup

You place the compiled .jar file in a handy place along with a few files (most found in the Github repo as examples):

Configuration

index.html # a template HTML file that is used to list all of the servers, uptimes, and other data.
server.html # a template HTML file that is used to list one server and all of the associated key and value pairs that it has.
templates.yml # a template YAML file that is used to create all of the server specific YAML files. Once these are made, they will appear in the servers folder created in the root that the jar is run in,
master.yml # a file that defines master keys, which are server side keys that define server characteristics locally, used to enable servers, specify if they are urgent to server uptime, and in the future the maintainers and if it's a VM, the VM-host operator.

Create a ./servers folder. The jar will crash without it. This will be fixed soon.

Inside the servers folder, there are configurable per-server configs.

Make sure that you check that your YAML files are parsed properly or I guarantee that the Java code will crash. There are a few good online checkers out there.

Startup

I made the startup script for the management server much the same as the client one, in fact I only changed the path to an executable SH file and changed the description slightly.

The edited SH file that starts it is as follows:

cd /manage
date >> runtime.log
java -jar management-server.jar >> runtime.txt

Other Notes:

Downsides (pending improvements)

One downside to the whole system is that it depends on TALOS's HTTPS server to be running when this starts because it fetches the domain files. It can use a fallback mechanism where it copies the file to the hard drive as a backup, and you could technically put the file there for it to read. A new configuration key needs to be added to the master list before this will work however.. coming soon! (there's a github fork called sans-talos)

There's also an error or two in the RAM information that I have been collecting, and I would like to also connect to the temperature sensors of each machine, which will likely require configuring a script for each sensor in each machine.

Systemd Helpful Tips

systemctl enable <name> for enabling units

systemctl disable <name> for disabling

systemctl start <name> for starting

systemctl stop <name> for stopping

systemctl status <name> for the executable status.

Hardware Implementation:

Fetch Armbian Jessie for the pcduino 3. It's OK that it's not the nano lite version even though currently we are using a pcduino 3 nano lite.

Flash that to the SD card, log into the root user set the root password, and then run the reboot command. Wait for it to restart again, and then reboot.

At this point, the system has set up the SSH server, expanded / to the full size of the SD card (up to 32GB).

Now, install a thing:

htop openssh-client vim openjdk-7-jdk p7zip-full g++ sudo git

And now edit some files (make them contain this following contents):

vim /etc/hostname

management

vim /etc/network/interfaces

# Wired adapter #1
auto eth0
	iface eth0 inet static
             address 128.153.145.62
             netmask 255.255.254.0
             gateway 128.153.145.1

# Local loopback
auto lo
	iface lo inet loopback

and edit the sshd config for the default cosi ssh port:

vim /etc/ssh/sshd_config

set the line that says Port

After you have done that, reboot.


You are now to follow the default instructions for setting up the software itself.

(Depreciated)

temperature.c:

#include <stdio.h>
#include <fcntl.h>
#include <math.h>
#define ADEV "/proc/adc2"

int main(void) {
  int aPin = open(ADEV, O_RDONLY);
  char tBuf[16];
  lseek(aPin, 0, SEEK_SET);
  int ret = read(aPin, tBuf, sizeof(tBuf));
  int aVal = atoi(tBuf+5);
  double temp = 0.0;
  temp = log(10000.0*((4096.0/(double)aVal - 1.0)));
  temp = 1 / (0.001129148 + (0.000234125 + (0.0000000876741 * temp * temp)) * temp);
  temp = temp - 273.15;
  temp = (temp * 9.0)/5.0 + 32.0;
  printf("%.0f\n", temp);
  return 0;
}

compile that with

gcc temperature.c -o temp -lm

temp.sh:

#!/bin/bash
TEMP=$(./temp)"F"
echo $TEMP
/manage/management-client 128.153.145.62 80 temp $TEMP

and make Management (and other scripts) run with the following init script:

/etc/init.d/manageserver

### BEGIN INIT INFO
# Provides:	manageserver
# Required-Start:	$remote_fs $syslog $network $all
# Required-Stop:	$remove_fs $syslog
# Default-Start:	2 3 4 5
# Default-Stop:	0 1 6
### END INIT INFO

/usr/bin/java -jar /manage/management-server.jar > runtime.log &

and make that run at startup with:

update-rc.d manageserver defaults

For more on LSB start scripts, visit https://wiki.debian.org/LSBInitScripts

Plans:

Additional planned features are:

  • database system to store the data collected
  • graph display of events?
  • select server subnet(s)
  • add specific server IP's not in the subnets
  • manage battery backups and tell servers when exactly to power down in the event of an outage
  • add some more master key configurations for fallback mechanisms
  • add server specific key functions and configurations (such as owner information, contact details, and others)
  • battery back the pcDuino in such a way that it will sustain itself independent of any UPS and general power failures long enough to outrun all of the other computers in operation and completely isolate itself from a main power supply within a second of when a power anomaly occurs until it determines the power to be a safe voltage. The nice thing about a SBC is that there isn't a really complex electrical system to provide so long as there's at least 4.85 to 5.25 volts of approximately 200mA of power to keep it going, the (dedicated) onboard power regulator handles the rest since it's designed for battery style installations.
  • make it independent of any server so that it will operate even when Talos is down. So long as there's a gateway and network this thing had better be running.