Troubleshooting Solaris Volume Manager (Tasks)

Troubleshooting Solaris Volume Manager (Tasks)This chapter describes how to troubleshoot problems that are related to Solaris Volume Manager. This chapter provides both general troubleshooting guidelines and specific procedures for resolving some known problems.This chapter includes the following information:Troubleshooting Solaris Volume Manager (Task Map) Overview of Troubleshooting the System Replacing Disks Recovering From Disk Movement Problems Device ID Discrepancies After Upgrading to the Solaris 10 Release Recovering From Boot Problems Recovering From State Database Replica Failures Recovering From Soft Partition Problems Recovering Storage From a Different System Recovering From Disk Set Problems Performing Mounted Filesystem Backups Using the ufsdump Command Performing System Recovery This chapter describes some Solaris Volume Manager problems and their appropriate solution. This chapter is not intended to be all-inclusive. but rather to present common scenarios and recovery procedures. Troubleshooting Solaris Volume Manager (Task Map)The following task map identifies some procedures that are needed to troubleshoot Solaris Volume Manager. Task Description For Instructions Replace a failed disk Replace a disk, then update state database replicas and logical volumes on the new disk. How to Replace a Failed Disk Recover from disk movement problems Restore disks to original locations or contact product support. Recovering From Disk Movement Problems Recover from improper /etc/vfstab entries Use the fsck command on the mirror, then edit the /etc/vfstab file so that the system boots correctly. How to Recover From Improper /etc/vfstab Entries Recover from a boot device failure Boot from a different submirror. How to Recover From a Boot Device Failure Recover from insufficient state database replicas Delete unavailable replicas by using the metadb command. How to Recover From Insufficient State Database Replicas Recover configuration data for a lost soft partition Use the metarecover command to recover configuration data for a soft partition. How to Recover Configuration Data for a Soft Partition Recover a Solaris Volume Manager configuration from salvaged disks Attach disks to a new system and have Solaris Volume Manager rebuild the configuration from the existing state database replicas. How to Recover Storage From a Local Disk Set Recover storage from a different system Import storage from known disk sets to a different system. Recovering Storage From a Different System Purge an inaccessible disk set. Use the metaset command to purge knowledge of a disk set that you cannot take or use. Recovering From Disk Set Problems Recover a system configuration stored on Solaris Volume Manager volumes. Use Solaris OS installation media to recover a system configuration stored on Solaris Volume Manager volumes. Performing System Recovery Overview of Troubleshooting the SystemPrerequisites for Troubleshooting the SystemTo troubleshoot storage management problems that are related to Solaris Volume Manager, you need to do the following:Have root privilege Have a current backup of all data General Guidelines for Troubleshooting Solaris Volume ManagerYou should have the following information on hand when you troubleshoot Solaris Volume Manager problems:Output from the metadb command Output from the metastat command Output from the metastat -p command Backup copy of the /etc/vfstab file Backup copy of the /etc/lvm/mddb.cf file Disk partition information from the prtvtoc command (SPARC systems) or the fdisk command (x86 based systems) The Solaris version on your system A list of the Solaris patches that have been installed A list of the Solaris Volume Manager patches that have been installed Any time you update your Solaris Volume Manager configuration, or make other storage or operating system-related changes to your system, generate fresh copies of this configuration information. You could also generate this information automatically with a cron job. General Troubleshooting ApproachAlthough no single procedure enables you to evaluate all problems with Solaris Volume Manager, the following process provides one general approach that might help.Gather information about current the configuration. Review the current status indicators, including the output from the metastat and metadb commands. This information should indicate which component is faulty. Check the hardware for obvious points of failure:Is everything connected properly? Was there a recent electrical outage? Have there been equipment changes or additions? Replacing DisksThis section describes how to replace disks in a Solaris Volume Manager environment. If you have soft partitions on a failed disk or on volumes that are built on a failed disk, you must put the new disk in the same physical location Also, use the same cntndn number as the disk being replaced. How to Replace a Failed DiskIdentify the failed disk to be replaced by examining the /var/adm/messages file and the metastat command output. Locate any state database replicas that might have been placed on the failed disk. Use the metadb command to find the replicas.The metadb command might report errors for the state database replicas that are located on the failed disk. In this example, c0t1d0 is the problem device.# metadb flags first blk block count a m u 16 1034 /dev/dsk/c0t0d0s4 a u 1050 1034 /dev/dsk/c0t0d0s4 a u 2084 1034 /dev/dsk/c0t0d0s4 W pc luo 16 1034 /dev/dsk/c0t1d0s4 W pc luo 1050 1034 /dev/dsk/c0t1d0s4 W pc luo 2084 1034 /dev/dsk/c0t1d0s4The output shows three state database replicas on each slice 4 of the local disks, c0t0d0 and c0t1d0. The W in the flags field of the c0t1d0s4 slice indicates that the device has write errors. Three replicas on the c0t0d0s4 slice are still good. Record the slice name where the state database replicas reside and the number of state database replicas. Then, delete the state database replicas.The number of state database replicas is obtained by counting the number of appearances of a slice in the metadb command output. In this example, the three state database replicas that exist on c0t1d0s4 are deleted. # metadb -d c0t1d0s4If, after deleting the bad state database replicas, you are left with three or fewer, add more state database replicas before continuing. Doing so helps to ensure that configuration information remains intact. Locate and delete any hot spares on the failed disk.Use the metastat command to find hot spares. In this example, hot spare pool hsp000 included c0t1d0s6, which is then deleted from the pool.# metahs -d hsp000 c0t1d0s6 hsp000: Hotspare is deleted Replace the failed disk.This step might entail using the cfgadm command, the luxadm command, or other commands as appropriate for your hardware and environment. When performing this step, make sure to follow your hardware's documented procedures to properly manipulate the Solaris state of this disk. Repartition the new disk.Use the format command or the fmthard command to partition the disk with the same slice information as the failed disk. If you have the prtvtoc output from the failed disk, you can format the replacement disk with the fmthard -s /tmp/failed-disk-prtvtoc-output command. If you deleted state database replicas, add the same number back to the appropriate slice.In this example, /dev/dsk/c0t1d0s4 is used.# metadb -a -c 3 c0t1d0s4 If any slices on the disk are components of RAID-5 volumes or are components of RAID-0 volumes that are in turn submirrors of RAID-1 volumes, run the metareplace -e command for each slice. In this example, /dev/dsk/c0t1d0s4 and mirror d10 are used.# metareplace -e d10 c0t1d0s4 If any soft partitions are built directly on slices on the replaced disk, run the metarecover -m -p command on each slice that contains soft partitions. This command regenerates the extent headers on disk.In this example, /dev/dsk/c0t1d0s4 needs to have the soft partition markings on disk regenerated. The slice is scanned and the markings are reapplied, based on the information in the state database replicas. # metarecover c0t1d0s4 -m -p If any soft partitions on the disk are components of RAID-5 volumes or are components of RAID-0 volumes that are submirrors of RAID-1 volumes, run the metareplace -e command for each slice. In this example, /dev/dsk/c0t1d0s4 and mirror d10 are used.# metareplace -e d10 c0t1d0s4 If any RAID-0 volumes have soft partitions built on them, run the metarecover command for each RAID-0 volume.In this example, RAID-0 volume, d17, has soft partitions built on it. # metarecover d17 -m -p Replace hot spares that were deleted, and add them to the appropriate hot spare pool or pools.In this example, hot spare pool, hsp000 included c0t1d0s6. This slice is added to the hot spare pool.# metahs -a hsp000 c0t1d0s6hsp000: Hotspare is added If soft partitions or nonredundant volumes were affected by the failure, restore data from backups. If only redundant volumes were affected, then validate your data.Check the user and application data on all volumes. You might have to run an application-level consistency checker, or use some other method to check the data. Recovering From Disk Movement ProblemsThis section describes how to recover from unexpected problems after moving disks in the Solaris Volume Manager environment. Disk Movement and Device ID OverviewSolaris Volume Manager uses device IDs, which are associated with a specific disk, to track all disks that are used in a Solaris Volume Manager configuration. When disks are moved to a different controller or when the SCSI target numbers change, Solaris Volume Manager usually correctly identifies the movement and updates all related Solaris Volume Manager records accordingly. No system administrator intervention is required. In isolated cases, Solaris Volume Manager cannot completely update the records and reports an error on boot. Resolving Unnamed Devices Error MessageIf you add new hardware or move hardware (for example, you move a string of disks from one controller to another controller), Solaris Volume Manager checks the device IDs that are associated with the disks that moved, and updates the cntndn names in internal Solaris Volume Manager records accordingly. If the records cannot be updated, the boot processes that are spawned by the svc:/system/mdmonitor service report an error to the console at boot time:Unable to resolve unnamed devices for volume management. Please refer to the Solaris Volume Manager documentation, Troubleshooting section, at http://docs.sun.com or from your local copy.No data loss has occurred, and none will occur as a direct result of this problem. This error message indicates that the Solaris Volume Manager name records have been only partially updated. Output from the metastat command shows some of the cntndn names that were previously used. The output also shows some of the cntndn names that reflect the state after the move. If you need to update your Solaris Volume Manager configuration while this condition exists, you must use the cntndn names that are reported by the metastat command when you issue any meta* commands. If this error condition occurs, you can do one of the following to resolve the condition:Restore all disks to their original locations. Next, do a reconfiguration reboot, or run (as a single command):/usr/sbin/devfsadm && /usr/sbin/metadevadm -rAfter these commands complete, the error condition is resolved. Contact your support representative for guidance.This error condition is quite unlikely to occur. If it does occur, it is most likely to affect Fibre Channel-attached storage. Device ID Discrepancies After Upgrading to the Solaris 10 ReleaseBeginning with the Solaris 10 release, device ID output is displayed in a new format. Solaris Volume Manager may display the device ID output in a new or old format depending on when the device id information was added to the state database replica.Previously, the device ID was displayed as a hexadecimal value. The new format displays the device ID as an ASCII string. In many cases, the change is negligible, as in the following example:old format:id1,ssd@w600c0ff00000000007ecd255a9336d00 new format:id1,ssd@n600c0ff00000000007ecd255a9336d00 In other cases, the change is more noticeable, as in the following example:old format:id1,sd@w4849544143484920444b3332454a2d33364e4320202020203433334239383939 new format:id1,ssd@n600c0ff00000000007ecd255a9336d00 When you upgrade to the Solaris 10 release, the format of the device IDs that are associated with existing disk sets that were created in a previous Solaris release are not updated in the Solaris Volume Manager configuration. If you need to revert back to a previous Solaris release, configuration changes made to disk sets after the upgrade might not available to that release. These configuration changes include:Adding a new disk to a disk set that existed before the upgrade Creating a new disk set Creating state database replicas These configuration changes can affect all disk sets that you are able to create in Solaris Volume Manager, including the local set. For example, if you implement any of these changes to a disk set created in the Solaris 10 release, you cannot import the disk set to a previous Solaris release. As another example, you might upgrade one side of a mirrored root to the Solaris 10 release and then make configuration changes to the local set. These changes would not be recognized if you then incorporated the submirror back into the previous Solaris release.The Solaris 10 OS configuration always displays the new format of the device ID, even in the case of an upgrade. You can display this information using the prtconf command. Conversely, Solaris Volume Manager displays either the old or the new format. Which format is displayed in Solaris Volume Manager depends on which version of the Solaris OS you were running when you began using the disk. To determine if Solaris Volume Manager is displaying a different, but equivalent, form of the device ID from that of the Solaris OS configuration, compare the output from the metastat command with the output from the prtconf command.In the following example, the metastat command output displays a different, but equivalent, form of the device ID for c1t6d0 from the prtconf command output for the same disk.# metastat d127: Concat/Stripe Size: 17629184 blocks (8.4 GB) Stripe 0: Device Start Block Dbase Reloc c1t6d0s2 32768 Yes Yes Device Relocation Information: Device Reloc Device ID c1t6d0 Yes id1,sd@w4849544143484920444b3332454a2d33364e4320202020203433334239383939# prtconf -v .(output truncated) . . sd, instance #6 System properties: name='lun' type=int items=1 value=00000000 name='target' type=int items=1 value=00000006 name='class' type=string items=1 value='scsi' Driver properties: name='pm-components' type=string items=3 dev=none value='NAME=spindle-motor' + '0=off' + '1=on' name='pm-hardware-state' type=string items=1 dev=none value='needs-suspend-resume' name='ddi-failfast-supported' type=boolean dev=none name='ddi-kernel-ioctl' type=boolean dev=none Hardware properties: name='devid' type=string items=1 value='id1,@THITACHI_DK32EJ-36NC_____433B9899' . . . (output truncated)The line containing “instance #6” in the output from the prtconf command correlates to the disk c1t6d0 in the output from the metastat command. The device id, id1,@THITACHI_DK32EJ-36NC_____433B9899, in the output from the prtconf command correlates to the device id, id1,sd@w4849544143484920444b3332454a2d33364e4320202020203433334239383939, in the output from the metastat command. This difference in output indicates that Solaris Volume Manager is displaying the hexadecimal form of the device ID in the output from the metastat command, while the Solaris 10 OS configuration is displaying an ASCII string in the output from the prtconf command. Recovering From Boot ProblemsBecause Solaris Volume Manager enables you to mirror the root (/), swap, and /usr directories, special problems can arise when you boot the system. These problems can arise either through hardware failures or operator error. The procedures in this section provide solutions to such potential problems.The following table describes these problems and points you to the appropriate solution.Common Boot Problems With Solaris Volume ManagerReason for the Boot Problem For Instructions The /etc/vfstab file contains incorrect information. How to Recover From Improper /etc/vfstab Entries Not enough state database replicas have been defined. How to Recover From Insufficient State Database Replicas A boot device (disk) has failed. How to Recover From a Boot Device Failure

Background Information for Boot ProblemsIf Solaris Volume Manager takes a volume offline due to errors, unmount all file systems on the disk where the failure occurred.Because each disk slice is independent, multiple file systems can be mounted on a single disk. If the software has encountered a failure, other slices on the same disk will likely experience failures soon. File systems that are mounted directly on disk slices do not have the protection of Solaris Volume Manager error handling. Leaving such file systems mounted can leave you vulnerable to crashing the system and losing data. Minimize the amount of time you run with submirrors that are disabled or offline. During resynchronization and online backup intervals, the full protection of mirroring is gone. How to Recover From Improper <filename>/etc/vfstab</filename> EntriesIf you have made an incorrect entry in the /etc/vfstab file, for example, when mirroring the root (/) file system, the system appears at first to be booting properly. Then, the system fails. To remedy this situation, you need to edit the /etc/vfstab file while in single-user mode. The high-level steps to recover from improper /etc/vfstab file entries are as follows:Booting the system to single-user mode Running the fsck command on the mirror volume Remounting file system read-write options enabled Optional: running the metaroot command for a root (/) mirror Verifying that the /etc/vfstab file correctly references the volume for the file system entry Rebooting the system Recovering the root (<filename>/</filename>) RAID-1 (Mirror) VolumeIn the following example, the root (/) file system is mirrored with a two-way mirror, d0. The root (/) entry in the /etc/vfstab file has somehow reverted back to the original slice of the file system. However, the information in the /etc/system file still shows booting to be from the mirror d0. The most likely reason is that the metaroot command was not used to maintain the /etc/system and /etc/vfstab files. Another possible reason is that an old copy of the/etc/vfstab file was copied back into the current /etc/vfstab file.The incorrect /etc/vfstab file looks similar to the following:#device device mount FS fsck mount mount #to mount to fsck point type pass at boot options # /dev/dsk/c0t3d0s0 /dev/rdsk/c0t3d0s0 / ufs 1 no - /dev/dsk/c0t3d0s1 - - swap - no - /dev/dsk/c0t3d0s6 /dev/rdsk/c0t3d0s6 /usr ufs 2 no - # /proc - /proc proc - no - swap - /tmp tmpfs - yes -Because of the errors, you automatically go into single-user mode when the system is booted: ok boot ... configuring network interfaces: hme0. Hostname: host1 mount: /dev/dsk/c0t3d0s0 is not this fstype. setmnt: Cannot open /etc/mnttab for writing INIT: Cannot create /var/adm/utmp or /var/adm/utmpx INIT: failed write of utmpx entry:" " INIT: failed write of utmpx entry:" " INIT: SINGLE USER MODE Type Ctrl-d to proceed with normal startup, (or give root password for system maintenance): <root-password>At this point, the root (/) and /usr file systems are mounted read-only. Follow these steps: Run the fsck command on the root (/) mirror.Be careful to use the correct volume for the root (/) mirror. # fsck /dev/md/rdsk/d0 ** /dev/md/rdsk/d0 ** Currently Mounted on / ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 2274 files, 11815 used, 10302 free (158 frags, 1268 blocks, 0.7% fragmentation) Remount the root (/) file system as read/write file system so that you can edit the /etc/vfstab file.# mount -o rw,remount /dev/md/dsk/d0 / mount: warning: cannot lock temp file </etc/.mnt.lock> Run the metaroot command.# metaroot d0This command edits the /etc/system and /etc/vfstab files to specify that the root (/) file system is now on volume d0. Verify that the /etc/vfstab file contains the correct volume entries.The root (/) entry in the /etc/vfstab file should appear as follows so that the entry for the file system correctly references the RAID-1 volume:#device device mount FS fsck mount mount #to mount to fsck point type pass at boot options # /dev/md/dsk/d0 /dev/md/rdsk/d0 / ufs 1 no - /dev/dsk/c0t3d0s1 - - swap - no - /dev/dsk/c0t3d0s6 /dev/rdsk/c0t3d0s6 /usr ufs 2 no - # /proc - /proc proc - no - swap - /tmp tmpfs - yes - Reboot the system.The system returns to normal operation. How to Recover From a Boot Device FailureIf you have a root (/) mirror and your boot device fails, you need to set up an alternate boot device.The high-level steps in this task are as follows:Booting from the alternate root (/) submirror Determining the erred state database replicas and volumes Repairing the failed disk Restoring state database replicas and volumes to their original state Initially, when the boot device fails, you'll see a message similar to the following. This message might differ among various architectures.Rebooting with command: Boot device: /iommu/sbus/dma@f,81000/esp@f,80000/sd@3,0 The selected SCSI device is not responding Can't open boot device ...When you see this message, note the device. Then, follow these steps: Boot from another root (/) submirror.Since only two of the six state database replicas in this example are in error, you can still boot. If this were not the case, you would need to delete the inaccessible state database replicas in single-user mode. This procedure is described in How to Recover From Insufficient State Database Replicas.When you created the mirror for the root (/) file system, you should have recorded the alternate boot device as part of that procedure. In this example, disk2 is that alternate boot device.ok boot disk2 SunOS Release 5.9 Version s81_51 64-bit Copyright 1983-2001 Sun Microsystems, Inc. All rights reserved. Hostname: demo ... demo console login: root Password: <root-password> Dec 16 12:22:09 host1 login: ROOT LOGIN /dev/console Last login: Wed Dec 12 10:55:16 on console Sun Microsystems Inc. SunOS 5.9 s81_51 May 2002 ... Determine how many state database replicas have failed by using the metadb command.# metadb flags first blk block count M p unknown unknown /dev/dsk/c0t3d0s3 M p unknown unknown /dev/dsk/c0t3d0s3 a m p luo 16 1034 /dev/dsk/c0t2d0s3 a p luo 1050 1034 /dev/dsk/c0t2d0s3 a p luo 16 1034 /dev/dsk/c0t1d0s3 a p luo 1050 1034 /dev/dsk/c0t1d0s3In this example, the system can no longer detect state database replicas on slice /dev/dsk/c0t3d0s3, which is part of the failed disk. Determine that half of the root (/), swap, and /usr mirrors have failed by using the metastat command.# metastat d0: Mirror Submirror 0: d10 State: Needs maintenance Submirror 1: d20 State: Okay ... d10: Submirror of d0 State: Needs maintenance Invoke: "metareplace d0 /dev/dsk/c0t3d0s0 <new device>" Size: 47628 blocks Stripe 0: Device Start Block Dbase State Hot Spare /dev/dsk/c0t3d0s0 0 No Maintenance d20: Submirror of d0 State: Okay Size: 47628 blocks Stripe 0: Device Start Block Dbase State Hot Spare /dev/dsk/c0t2d0s0 0 No Okay d1: Mirror Submirror 0: d11 State: Needs maintenance Submirror 1: d21 State: Okay ... d11: Submirror of d1 State: Needs maintenance Invoke: "metareplace d1 /dev/dsk/c0t3d0s1 <new device>" Size: 69660 blocks Stripe 0: Device Start Block Dbase State Hot Spare /dev/dsk/c0t3d0s1 0 No Maintenance d21: Submirror of d1 State: Okay Size: 69660 blocks Stripe 0: Device Start Block Dbase State Hot Spare /dev/dsk/c0t2d0s1 0 No Okay d2: Mirror Submirror 0: d12 State: Needs maintenance Submirror 1: d22 State: Okay ... d12: Submirror of d2 State: Needs maintenance Invoke: "metareplace d2 /dev/dsk/c0t3d0s6 <new device>" Size: 286740 blocks Stripe 0: Device Start Block Dbase State Hot Spare /dev/dsk/c0t3d0s6 0 No Maintenance d22: Submirror of d2 State: Okay Size: 286740 blocks Stripe 0: Device Start Block Dbase State Hot Spare /dev/dsk/c0t2d0s6 0 No Okay In this example, the metastat command shows that the following submirrors need maintenance:Submirror d10, device c0t3d0s0 Submirror d11, device c0t3d0s1 Submirror d12, device c0t3d0s6 Halt the system, replace the disk. Use the format command or the fmthard command, to partition the disk as it was before the failure. If the new disk is identical to the existing disk (the intact side of the mirror, in this example), quickly format the new disk. To do so, use the prtvtoc /dev/rdsk/c0t2d0s2 | fmthard -s - /dev/rdsk/c0t3d0s2 command (c0t3d0, in this example). # halt ... Halted ... ok boot ... # format /dev/rdsk/c0t3d0s0 Reboot the system.Note that you must reboot from the other half of the root (/) mirror. You should have recorded the alternate boot device when you created the mirror.# halt ... ok boot disk2 To delete the failed state database replicas and then add them back, use the metadb command. # metadb flags first blk block count M p unknown unknown /dev/dsk/c0t3d0s3 M p unknown unknown /dev/dsk/c0t3d0s3 a m p luo 16 1034 /dev/dsk/c0t2d0s3 a p luo 1050 1034 /dev/dsk/c0t2d0s3 a p luo 16 1034 /dev/dsk/c0t1d0s3 a p luo 1050 1034 /dev/dsk/c0t1d0s3 # metadb -d c0t3d0s3 # metadb -c 2 -a c0t3d0s3 # metadb flags first blk block count a m p luo 16 1034 /dev/dsk/c0t2d0s3 a p luo 1050 1034 /dev/dsk/c0t2d0s3 a p luo 16 1034 /dev/dsk/c0t1d0s3 a p luo 1050 1034 /dev/dsk/c0t1d0s3 a u 16 1034 /dev/dsk/c0t3d0s3 a u 1050 1034 /dev/dsk/c0t3d0s3 Re-enable the submirrors by using the metareplace command.# metareplace -e d0 c0t3d0s0 Device /dev/dsk/c0t3d0s0 is enabled # metareplace -e d1 c0t3d0s1 Device /dev/dsk/c0t3d0s1 is enabled # metareplace -e d2 c0t3d0s6 Device /dev/dsk/c0t3d0s6 is enabledAfter some time, the resynchronization will complete. You can now return to booting from the original device. Recovering From State Database Replica FailuresIf the state database replica quorum is not met, for example, due to a drive failure, the system cannot be rebooted into multiuser mode. This situation could follow a panic when Solaris Volume Manager discovers that fewer than half of the state database replicas are available. This situation could also occur if the system is rebooted with exactly half or fewer functional state database replicas. In Solaris Volume Manager terminology, the state database has gone “stale.” This procedure explains how to recover from this problem.How to Recover From Insufficient State Database ReplicasBoot the system. Determine which state database replicas are unavailable.# metadb -i If one or more disks are known to be unavailable, delete the state database replicas on those disks. Otherwise, delete enough erred state database replicas (W, M, D, F, or R status flags reported by metadb) to ensure that a majority of the existing state database replicas are not erred.# metadb -d disk-sliceState database replicas with a capitalized status flag are in error. State database replicas with a lowercase status flag are functioning normally. Verify that the replicas have been deleted.# metadb Reboot the system.# reboot If necessary, replace the disk, format it appropriately, then add any state database replicas that are needed to the disk.Follow the instructions in Creating State Database Replicas.Once you have a replacement disk, halt the system, replace the failed disk, and once again, reboot the system. Use the format command or the fmthard command to partition the disk as it was configured before the failure. Recovering From Stale State Database ReplicasIn the following example, a disk that contains seven replicas has gone bad. As a result, the system has only three good replicas. The system panics, then cannot reboot into multiuser mode.panic[cpu0]/thread=70a41e00: md: state database problem 403238a8 md:mddb_commitrec_wrapper+6c (2, 1, 70a66ca0, 40323964, 70a66ca0, 3c) %l0-7: 0000000a 00000000 00000001 70bbcce0 70bbcd04 70995400 00000002 00000000 40323908 md:alloc_entry+c4 (70b00844, 1, 9, 0, 403239e4, ff00) %l0-7: 70b796a4 00000001 00000000 705064cc 70a66ca0 00000002 00000024 00000000 40323968 md:md_setdevname+2d4 (7003b988, 6, 0, 63, 70a71618, 10) %l0-7: 70a71620 00000000 705064cc 70b00844 00000010 00000000 00000000 00000000 403239f8 md:setnm_ioctl+134 (7003b968, 100003, 64, 0, 0, ffbffc00) %l0-7: 7003b988 00000000 70a71618 00000000 00000000 000225f0 00000000 00000000 40323a58 md:md_base_ioctl+9b4 (157ffff, 5605, ffbffa3c, 100003, 40323ba8, ff1b5470) %l0-7: ff3f2208 ff3f2138 ff3f26a0 00000000 00000000 00000064 ff1396e9 00000000 40323ad0 md:md_admin_ioctl+24 (157ffff, 5605, ffbffa3c, 100003, 40323ba8, 0) %l0-7: 00005605 ffbffa3c 00100003 0157ffff 0aa64245 00000000 7efefeff 81010100 40323b48 md:mdioctl+e4 (157ffff, 5605, ffbffa3c, 100003, 7016db60, 40323c7c) %l0-7: 0157ffff 00005605 ffbffa3c 00100003 0003ffff 70995598 70995570 0147c800 40323bb0 genunix:ioctl+1dc (3, 5605, ffbffa3c, fffffff8, ffffffe0, ffbffa65) %l0-7: 0114c57c 70937428 ff3f26a0 00000000 00000001 ff3b10d4 0aa64245 00000000 panic: stopped at edd000d8: ta %icc,%g0 + 125 Type 'go' to resume ok boot -s Resetting ... Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 270MHz), No Keyboard OpenBoot 3.11, 128 MB memory installed, Serial #9841776. Ethernet address 8:0:20:96:2c:70, Host ID: 80962c70. Rebooting with command: boot -s Boot device: /pci@1f,0/pci@1,1/ide@3/disk@0,0:a File and args: -s SunOS Release 5.9 Version s81_39 64-bit Copyright 1983-2001 Sun Microsystems, Inc. All rights reserved. configuring IPv4 interfaces: hme0. Hostname: dodo metainit: dodo: stale databases Insufficient metadevice database replicas located. Use metadb to delete databases which are broken. Ignore any "Read-only file system" error messages. Reboot the system when finished to reload the metadevice database. After reboot, repair any broken database replicas which were deleted. Type control-d to proceed with normal startup, (or give root password for system maintenance): root-password single-user privilege assigned to /dev/console. Entering System Maintenance Mode Jun 7 08:57:25 su: 'su root' succeeded for root on /dev/console Sun Microsystems Inc. SunOS 5.9 s81_39 May 2002 # metadb -i flags first blk block count a m p lu 16 8192 /dev/dsk/c0t0d0s7 a p l 8208 8192 /dev/dsk/c0t0d0s7 a p l 16400 8192 /dev/dsk/c0t0d0s7 M p 16 unknown /dev/dsk/c1t1d0s0 M p 8208 unknown /dev/dsk/c1t1d0s0 M p 16400 unknown /dev/dsk/c1t1d0s0 M p 24592 unknown /dev/dsk/c1t1d0s0 M p 32784 unknown /dev/dsk/c1t1d0s0 M p 40976 unknown /dev/dsk/c1t1d0s0 M p 49168 unknown /dev/dsk/c1t1d0s0 # metadb -d c1t1d0s0 # metadb flags first blk block count a m p lu 16 8192 /dev/dsk/c0t0d0s7 a p l 8208 8192 /dev/dsk/c0t0d0s7 a p l 16400 8192 /dev/dsk/c0t0d0s7 # The system panicked because it could no longer detect state database replicas on slice /dev/dsk/c1t1d0s0. This slice is part of the failed disk or is attached to a failed controller. The first metadb command identifies the replicas on this slice as having a problem with the master blocks.When you delete the stale state database replicas, the root (/) file system is read-only. You can ignore the mddb.cf error messages that are displayed.At this point, the system is again functional, although it probably has fewer state database replicas than it should. Any volumes that used part of the failed storage are also either failed, erred, or hot-spared. Those issues should be addressed promptly. Recovering From Soft Partition ProblemsThis section shows how to recover configuration information for soft partitions. You should only use the following procedure if all of your state database replicas have been lost and you do not have one of the following:A current or accurate copy of metastat -p output A current or accurate copy of the md.cf file An up-to-date md.tab file How to Recover Configuration Data for a Soft PartitionAt the beginning of each soft partition extent, a sector is used to mark the beginning of the soft partition extent. These hidden sectors are called extent headers. These headers do not appear to the user of the soft partition. If all Solaris Volume Manager configuration data is lost, the disk can be scanned in an attempt to generate the configuration data. This procedure is a last option to recover lost soft partition configuration information. The metarecover command should only be used when you have lost both your metadb and md.cf files, and your md.tab file is lost or out of date. This procedure only works to recover soft partition information. This procedure does not assist in recovering from other lost configurations or for recovering configuration information for other Solaris Volume Manager volumes. If your configuration included other Solaris Volume Manager volumes that were built on top of soft partitions, you should recover the soft partitions before attempting to recover the other volumes. Configuration information about your soft partitions is stored on your devices and in your state database. Since either source could be corrupt, you must indicate to the metarecover command which source is reliable. First, use the metarecover command to determine whether the two sources agree. If they do agree, the metarecover command cannot be used to make any changes. However, if the metarecover command reports an inconsistency, you must examine its output carefully to determine whether the disk or the state database is corrupt. Then, you should use the metarecover command to rebuild the configuration based on the appropriate source. Read the Configuration Guidelines for Soft Partitions. Review the soft partition recovery information by using the metarecover command.# metarecover component-p componentSpecifies the cntndnsnname of the raw component Specifies to regenerate soft partitions Specifies to scan the physical slice for extent headers of soft partitions Recovering Soft Partitions from On-Disk Extent Headers# metarecover c1t1d0s1 -p -d The following soft partitions were found and will be added to your metadevice configuration. Name Size No. of Extents d10 10240 1 d11 10240 1 d12 10240 1 # metarecover c1t1d0s1 -p -d The following soft partitions were found and will be added to your metadevice configuration. Name Size No. of Extents d10 10240 1 d11 10240 1 d12 10240 1 WARNING: You are about to add one or more soft partition metadevices to your metadevice configuration. If there appears to be an error in the soft partition(s) displayed above, do NOT proceed with this recovery operation. Are you sure you want to do this (yes/no)?yes c1t1d0s1: Soft Partitions recovered from device. bash-2.05# metastat d10: Soft Partition Device: c1t1d0s1 State: Okay Size: 10240 blocks Device Start Block Dbase Reloc c1t1d0s1 0 No Yes Extent Start Block Block count 0 1 10240 d11: Soft Partition Device: c1t1d0s1 State: Okay Size: 10240 blocks Device Start Block Dbase Reloc c1t1d0s1 0 No Yes Extent Start Block Block count 0 10242 10240 d12: Soft Partition Device: c1t1d0s1 State: Okay Size: 10240 blocks Device Start Block Dbase Reloc c1t1d0s1 0 No Yes Extent Start Block Block count 0 20483 10240In this example, three soft partitions are recovered from disk, after the state database replicas were accidentally deleted. Recovering Storage From a Different SystemYou can recover a Solaris Volume Manager configuration, even onto a different system from the original system. How to Recover Storage From a Local Disk SetIf you experience a system failure, you can attach the storage to a different system and recover the complete configuration from the local disk set. For example, assume you have a system with an external disk pack of six disks in it and a Solaris Volume Manager configuration, including at least one state database replica, on some of those disks. If you have a system failure, you can physically move the disk pack to a new system and enable the new system to recognize the configuration. This procedure describes how to move the disks to another system and recover the configuration from a local disk set.This recovery procedure works only with Solaris 9, and later, Solaris Volume Manager volumes. Attach the disk or disks that contain the Solaris Volume Manager configuration to a system with no preexisting Solaris Volume Manager configuration. Do a reconfiguration reboot to ensure that the system recognizes the newly added disks.# reboot -- -r Determine the major/minor number for a slice containing a state database replica on the newly added disks. Use ls -lL, and note the two numbers between the group name and the date. These numbers are the major/minor numbers for this slice. # ls -Ll /dev/dsk/c1t9d0s7 brw-r----- 1 root sys 32, 71 Dec 5 10:05 /dev/dsk/c1t9d0s7 If necessary, determine the major name corresponding with the major number by looking up the major number in /etc/name_to_major.# grep " 32" /etc/name_to_major sd 32 Update the /kernel/drv/md.conf file with the information that instructs Solaris Volume Manager where to find a valid state database replica on the new disks.For example, in the line that begins with mddb_bootlist1, replace the sd with the major name you found in step 4. Replace 71 in the example with the minor number you identified in Step 3.#pragma ident "@(#)md.conf 2.2 04/04/02 SMI" # # Copyright 2004 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. # # The parameters nmd and md_nsets are obsolete. The values for these # parameters no longer have any meaning. name="md" parent="pseudo" nmd=128 md_nsets=4; # Begin MDD database info (do not edit) mddb_bootlist1="sd:71:16:id0"; # End MDD database info (do not edit) Reboot to force Solaris Volume Manager to reload your configuration. You will see messages similar to the following displayed on the console. volume management starting. Dec 5 10:11:53 host1 metadevadm: Disk movement detected Dec 5 10:11:53 host1 metadevadm: Updating device names in Solaris Volume Manager The system is ready. Verify your configuration. Use the metadb command to verify the status of the state database replicas. and metastat command view the status for each volume.# metadb flags first blk block count a m p luo 16 8192 /dev/dsk/c1t9d0s7 a luo 16 8192 /dev/dsk/c1t10d0s7 a luo 16 8192 /dev/dsk/c1t11d0s7 a luo 16 8192 /dev/dsk/c1t12d0s7 a luo 16 8192 /dev/dsk/c1t13d0s7 # metastat d12: RAID State: Okay Interlace: 32 blocks Size: 125685 blocks Original device: Size: 128576 blocks Device Start Block Dbase State Reloc Hot Spare c1t11d0s3 330 No Okay Yes c1t12d0s3 330 No Okay Yes c1t13d0s3 330 No Okay Yes d20: Soft Partition Device: d10 State: Okay Size: 8192 blocks Extent Start Block Block count 0 3592 8192 d21: Soft Partition Device: d10 State: Okay Size: 8192 blocks Extent Start Block Block count 0 11785 8192 d22: Soft Partition Device: d10 State: Okay Size: 8192 blocks Extent Start Block Block count 0 19978 8192 d10: Mirror Submirror 0: d0 State: Okay Submirror 1: d1 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 82593 blocks d0: Submirror of d10 State: Okay Size: 118503 blocks Stripe 0: (interlace: 32 blocks) Device Start Block Dbase State Reloc Hot Spare c1t9d0s0 0 No Okay Yes c1t10d0s0 3591 No Okay Yes d1: Submirror of d10 State: Okay Size: 82593 blocks Stripe 0: (interlace: 32 blocks) Device Start Block Dbase State Reloc Hot Spare c1t9d0s1 0 No Okay Yes c1t10d0s1 0 No Okay Yes Device Relocation Information: Device Reloc Device ID c1t9d0 Yes id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3487980000U00907AZ c1t10d0 Yes id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3397070000W0090A8Q c1t11d0 Yes id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3449660000U00904NZ c1t12d0 Yes id1,sd@SSEAGATE_ST39103LCSUN9.0GLS32655400007010H04J c1t13d0 Yes id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3461190000701001T0 # # metadb flags first blk block count a m p luo 16 8192 /dev/dsk/c1t9d0s7 a luo 16 8192 /dev/dsk/c1t10d0s7 a luo 16 8192 /dev/dsk/c1t11d0s7 a luo 16 8192 /dev/dsk/c1t12d0s7 a luo 16 8192 /dev/dsk/c1t13d0s7 # metastat d12: RAID State: Okay Interlace: 32 blocks Size: 125685 blocks Original device: Size: 128576 blocks Device Start Block Dbase State Reloc Hot Spare c1t11d0s3 330 No Okay Yes c1t12d0s3 330 No Okay Yes c1t13d0s3 330 No Okay Yes d20: Soft Partition Device: d10 State: Okay Size: 8192 blocks Extent Start Block Block count 0 3592 8192 d21: Soft Partition Device: d10 State: Okay Size: 8192 blocks Extent Start Block Block count 0 11785 8192 d22: Soft Partition Device: d10 State: Okay Size: 8192 blocks Extent Start Block Block count 0 19978 8192 d10: Mirror Submirror 0: d0 State: Okay Submirror 1: d1 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 82593 blocks d0: Submirror of d10 State: Okay Size: 118503 blocks Stripe 0: (interlace: 32 blocks) Device Start Block Dbase State Reloc Hot Spare c1t9d0s0 0 No Okay Yes c1t10d0s0 3591 No Okay Yes d1: Submirror of d10 State: Okay Size: 82593 blocks Stripe 0: (interlace: 32 blocks) Device Start Block Dbase State Reloc Hot Spare c1t9d0s1 0 No Okay Yes c1t10d0s1 0 No Okay Yes Device Relocation Information: Device Reloc Device ID c1t9d0 Yes id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3487980000U00907AZ1 c1t10d0 Yes id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3397070000W0090A8Q c1t11d0 Yes id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3449660000U00904NZ c1t12d0 Yes id1,sd@SSEAGATE_ST39103LCSUN9.0GLS32655400007010H04J c1t13d0 Yes id1,sd@SSEAGATE_ST39103LCSUN9.0GLS3461190000701001T0 # metastat -p d12 -r c1t11d0s3 c1t12d0s3 c1t13d0s3 -k -i 32b d20 -p d10 -o 3592 -b 8192 d21 -p d10 -o 11785 -b 8192 d22 -p d10 -o 19978 -b 8192 d10 -m d0 d1 1 d0 1 2 c1t9d0s0 c1t10d0s0 -i 32b d1 1 2 c1t9d0s1 c1t10d0s1 -i 32b # Recovering Storage From a Known Disk SetThe introduction of device ID support for disk sets in Solaris Volume Manager allows you to recover storage from known disk sets and to import the disk set to a different system. The metaimport command allows you to import known disk sets from one system to another system. Both systems must contain existing Solaris Volume Manager configurations that include device ID support. For more information on device ID support, see Asynchronous Shared Storage in Disk Sets. For more information on the metaimport command, see the metaimport1M man page.How to Print a Report on Disk Sets Available for ImportBecome superuser. Obtain a report on disk sets available for import.# metaimport Provides a report of the unconfigured disk sets available for import on the system. Provides detailed information about the state database replica location and status on the disks of unconfigured disk sets available for import on the system. Reporting on Disk Sets Available for ImportThe following examples show how to print a report on disk sets available for import.# metaimport -r Drives in regular diskset including disk c1t2d0: c1t2d0 c1t3d0 More info: metaimport -r -v c1t2d0 Import: metaimport -s <newsetname> c1t2d0 Drives in replicated diskset including disk c1t4d0: c1t4d0 c1t5d0 More info: metaimport -r -v c1t4d0 Import: metaimport -s <newsetname> c1t4d0 # metaimport -r -v c1t2d0 Import: metaimport -s <newsetname> c1t2d0 Last update: Mon Dec 29 14:13:35 2003 Device offset length replica flags c1t2d0 16 8192 a u c1t3d0 16 8192 a u c1t8d0 16 8192 a u How to Import a Disk Set From One System to Another SystemBecome superuser. Verify that a disk set is available for import .# metaimport Import an available disk set.# metaimport diskset-name drive-name diskset-nameSpecifies the name of the disk set being created. drive-nameIdentifies a disk (c#t#d#) containing a state database replica from the disk set being imported. Verify that the disk set has been imported.# metaset diskset-name Importing a Disk SetThe following example shows how to import a disk set.# metaimport -s red c1t2d0 Drives in diskset including disk c1t2d0: c1t2d0 c1t3d0 c1t8d0 # metaset -s red Set name = red, Set number = 1 Host Owner host1 Yes Drive Dbase c1t2d0 Yes c1t3d0 Yes c1t8d0 Yes Recovering From Disk Set ProblemsThe following sections detail how to recover from specific disk set related problems. What to Do When You Cannot Take Ownership of A Disk SetIn cases in which you cannot take ownership of a disk set from any node (perhaps as a result of a system failure, disk failure, or communication link failure), and therefore cannot delete the disk set record, it is possible to purge the disk set from the Solaris Volume Manager state database replica records on the current host. Purging the disk set records does not affect the state database information contained in the disk set, so the disk set could later be imported (with the metaimport command, described at Importing Disk Sets).If you need to purge a disk set from a Sun Cluster configuration, use the following procedure, but use the option instead of the option you use when no Sun Cluster configuration is present.How to Purge a Disk SetAttempt to take the disk set with the metaset command.# metaset -s setname -t -fThis command will attempt to take () the disk set named setname forcibly (). If the set can be taken, this command will succeed. If the set is owned by another host when this command runs, the other host will panic to avoid data corruption or loss. If this command succeeds, you can delete the disk set cleanly, without the need to purge the set. If it is not possible to take the set, you may purge ownership records. Use the metaset command with the to purge the disk set from the current host.# metaset -s setname -PThis command will purge () the disk set named setname from the host on which the command is run. Use the metaset command to verify that the set has been purged.# metaset Purging a Disk Sethost1# metaset -s red -t -f metaset: host1: setname "red": no such sethost2# metaset Set name = red, Set number = 1 Host Owner host2 Drive Dbase c1t2d0 Yes c1t3d0 Yes c1t8d0 Yes host2# metaset -s red -P host2# metaset Chapter 18, Disk Sets (Overview), for conceptual information about disk sets. Chapter 19, Disk Sets (Tasks), for information about tasks associated with disk sets. Performing Mounted Filesystem Backups Using the <command>ufsdump</command> CommandThe following procedure describes how to increase the performance of the ufsdump command when you use it to backup a mounted filesystem located on a RAID-1 volume.How to Perform a Backup of a Mounted Filesystem Located on a RAID-1 VolumeYou can use the ufsdump command to backup the files of a mounted filesystem residing on a RAID-1 volume. Set the read policy on the volume to "first" when the backup utility is ufsdump. This improves the rate at which the backup is performed. Become superuser. Run the metastat command to make sure the mirror is in the “Okay” state.# metastat d40 d40: Mirror Submirror 0: d41 State: Okay Submirror 1: d42 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 20484288 blocks (9.8 GB)A mirror that is in the “Maintenance” state should be repaired first. Set the read policy on the mirror to “first.”# metaparam -r first d40 # metastat d40 d40: Mirror Submirror 0: d41 State: Okay Submirror 1: d42 State: Okay Pass: 1 Read option: first Write option: parallel (default) Size: 20484288 blocks (9.8 GB) Perform a backup the filesystem.# ufsdump 0f /dev/backup /opt/test After the ufsdump command is done, set the read policy on the mirror to “roundrobin.”# metaparam -r roundrobin d40 # metastat d40 d40: Mirror Submirror 0: d41 State: Okay Submirror 1: d42 State: Okay Pass: 1 Read option: roundrobin Write option: parallel (default) Size: 20484288 blocks (9.8 GB) Performing System RecoverySometimes it is useful to boot from a Solaris OS install image on DVD or CD media to perform a system recovery. Resetting the root password is one example of when using the install image is useful.If you are using a Solaris Volume Manager configuration, then you want to mount the Solaris Volume Manager volumes instead of the underlying disks. This step is especially important if the root (/) file system is mirrored. Because Solaris Volume Manager is part of the Solaris OS, mounting the Solaris Volume Manager volumes ensures that any changes are reflected on both sides of the mirror.Use the following procedure to make the Solaris Volume Manager volumes accessible from a Solaris OS DVD or CD-ROM install image.How to Recover a System Using a Solaris Volume Manager ConfigurationBoot your system from the Solaris OS installation DVD or CD media. Perform this procedure from the root prompt of the Solaris miniroot. Mount as read only the underlying disk containing the Solaris Volume Manager configuration.# mount -o ro /dev/dsk/c0t0d0s0 /a Copy the md.conf file into /kernel/drv directory.# cp /a/kernel/drv/md.conf /kernel/drv/md.conf Unmount the file system from the miniroot.# umount /a Update the Solaris Volume Manager driver to load the configuration. Ignore any warning messages printed by the update_drv command.# update_drv -f md Configure the system volumes.# metainit -r If you have RAID-1 volumes in the Solaris Volume Manager configuration, resynchronize them.# metasync mirror-name Solaris Volume Manager volumes should be accessible using the mount command.# mount /dev/md/dsk/volume-name /a Recovering a System Using a Solaris Volume Manager Configuration# mount -o ro /dev/dsk/c0t0d0s0 /a # cp /a/kernel/drv/md.conf /kernel/drv/md.conf # umount /a # update_drv -f md Cannot unload module: md Will be unloaded upon reboot. Forcing update of md.conf. devfsadm: mkdir fialed for /dev 0xled: Read-only file system devfsadm: inst_sync failed for /etc/path_to_inst.1359: Read-only file system devfsadm: WARNING: failed to update /etc/path_to_inst # metainit -r # metasync d0 # mount /dev/md/dsk/d0 /a