23 abril 2013

Cómo se puede emular la perdida de un disco externo en Linux

Una de las pruebas de certificación de los Clusters de Oracle es observar el comportamiento ante la pérdida de uno de los voting disk.

Estos discos normalmente se presentan a los servidores desde cabinas de almacenamiento externo y suelen tener varios caminos (dependiendo de las tarjetas HBA que dispongamos en el sistema).

Para emular la caída de un disco simplemente tenemos que enviar comandos scsi al dispositivo de bloque, indicándole que cambie su estado a offline.

Vamos a ver un ejemplo:


Primero buscamos nuestro disco objetivo. En un entorno Red Hat ejecutaríamos el comando "multipath -ll"

CRS1 (20017380050d10023) dm-2 IBM,2810XIV
size=16G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 7:0:0:1 sdb 8:16  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 8:0:0:1 sdf 8:80  active ready running
|-+- policy='round-robin 0' prio=1 status=enabled
| `- 7:0:1:1 sdj 8:144 active ready running
`-+- policy='round-robin 0' prio=1 status=active
  `- 8:0:1:1 sdn 8:208 active ready running

Podemos ver que los distintos caminos se corresponden a los dispositivos de bloque "sdb, sdf, sdj, sdn".

Ahora tenemos que poner offline cada uno de los caminos, para ello haríamos esto con cada uno (sólo mostramos el último):

# echo offline > /sys/class/block/sdn/device/state

Si hacemos un cat al fichero "state" de cada uno de los dispositivos, podremos ver que están "offline".

En "messages" veríamos estos mensajes:

Apr 23 15:54:00 server kernel: sd 7:0:0:1: rejecting I/O to offline device
Apr 23 15:54:00 server kernel: device-mapper: multipath: Failing path 8:16.
Apr 23 15:54:00 server multipathd: 8:16: mark as failed
Apr 23 15:54:00 server multipathd: CRS1: remaining active paths: 3
Apr 23 15:54:10 server kernel: sd 8:0:0:1: rejecting I/O to offline device
Apr 23 15:54:10 server kernel: device-mapper: multipath: Failing path 8:80.
Apr 23 15:54:10 server multipathd: 8:80: mark as failed
Apr 23 15:54:10 server multipathd: CRS1: remaining active paths: 2
Apr 23 15:54:18 server kernel: sd 7:0:1:1: rejecting I/O to offline device
Apr 23 15:54:18 server kernel: device-mapper: multipath: Failing path 8:144.
Apr 23 15:54:18 server multipathd: 8:144: mark as failed
Apr 23 15:54:18 server multipathd: CRS1: remaining active paths: 1
Apr 23 15:54:32 server kernel: sd 8:0:1:1: rejecting I/O to offline device
Apr 23 15:54:32 server kernel: device-mapper: multipath: Failing path 8:208.
Apr 23 15:54:32 server kernel: end_request: I/O error, dev dm-2, sector 524368
Apr 23 15:54:32 server multipathd: 8:208: mark as failed
Apr 23 15:54:32 server multipathd: CRS1: remaining active paths: 0
Apr 23 15:54:33 server kernel: end_request: I/O error, dev dm-2, sector 0
Apr 23 15:54:33 server kernel: end_request: I/O error, dev dm-2, sector 4151

Para ponerlos en en el estado normal ejecutaríamos en cada dispositivo:


# echo running > /sys/class/block/sdn/device/state


Y veríamos en "messages" la recuperación:

Apr 23 16:05:01 server multipathd: CRS1: sdn - directio checker reports path is up
Apr 23 16:05:01 server multipathd: 8:208: reinstated
Apr 23 16:05:01 server multipathd: CRS1: remaining active paths: 1
Apr 23 16:05:14 server multipathd: CRS1: sdj - directio checker reports path is up
Apr 23 16:05:14 server multipathd: 8:144: reinstated
Apr 23 16:05:14 server multipathd: CRS1: remaining active paths: 2
Apr 23 16:05:25 server multipathd: CRS1: sdf - directio checker reports path is up
Apr 23 16:05:25 server multipathd: 8:80: reinstated
Apr 23 16:05:25 server multipathd: CRS1: remaining active paths: 3
Apr 23 16:05:40 server multipathd: CRS1: sdb - directio checker reports path is up
Apr 23 16:05:40 server multipathd: 8:16: reinstated
Apr 23 16:05:40 server multipathd: CRS1: remaining active paths: 4