Copyright 2006, Jeff R. Allen <jra@nella.org>. Licensed under a Creative Commons License.
This is version 1 of this report. It is made up entirely of my opinions, and has only been reviewed by one other person who worked on the project. Future versions of this document will encompass feedback from my colleagues as it arrives.
Table of contents:
Hurricane Katrina devastated the Gulf Coast on 29 August 2005. In the weeks that followed, Radio Response pulled together a team of over 50 volunteers using approximately $30,000 of donations of products and services from dozens of vendors to provide Internet connectivity to residents and relief workers in the Hancock County, MS area.
We proved that it is possible to deploy a wireless long-haul IP network in a disaster area. We also proved that it was possible to create a distribution network, which could distribute either local satellite IP bandwidth, or terrestrial IP bandwidth delivered via wireless backhaul. Furthermore, we proved that we could provide the services of a small scale wireless ISP (site surveys, installations, checkup visits, reliable IP, phone support) in a disaster environment.
We proved that our service was useful and desired, both by the residents of Hancock County, and as a tool to empower fellow aid organizations.
Our experience deploying VoIP was inconclusive. There are tantalizing parts of our experience that support the argument that wireless IP and VoIP can provide much more nimble service than legacy utilities. However, due to failures of technology and management in our project, we deployed VoIP much later than simple HTTP. As a result, VoIP telephone calls were a very small percentage of our bandwidth, and an extraordinarily small amount of the "good" we did in the area.
I found that we were very inefficient in all aspects of the project. From the fact that we were doing this for the first time, through the organization of staff, to the lack of a budget, there were plenty of reasons why we were inefficient. All of them are solvable, and my overall impression is that a dedicated group could do this job much more effectively than we did by planning ahead for the next disaster.
In my opinion, to be effective the organization would need to:
All of these things cost money. Thus, to be effective, the organization
would need a budget. However, it needn't cost too much. The equipment
cache can be made up of donated equipment, some of which we already have.
The staff member on call would not be a paid position, just someone who
has
made a commitment to act as a leader during a deployment. The money
would mostly be spent on administrative and logistical support of the
volunteers themselves. Without a detailed budget it is hard to know,
but I suspect it would be possible to operate with $10,000 per year.
I arrived after this chapter was complete, so the following
is my understanding from talking to people involved. Any errors
are due to my failure to learn the story right.
The project started when evacuees from New Orleans began arriving in
northern Louisiana. Mac Dearman saw them at a church in his town
and realized that by simply giving the church a free connection to
his existing wireless network and putting a couple phones on a table,
he could help evacuees make contact with loved ones.
Within a couple hours, Mac had done a normal customer
install at the church (like any other customer on his network), and
installed a few pre-activated VoIP phones he had in stock (Uniden
UIP-2000 phones with service from Nuvio). Mac received requests
to hook up other shelters, both inside his network's coverage area,
and in neighboring towns. He put out the word to the wireless Internet
Service Provider (WISP) industry that with donations of equipment
and time, he could repeat his success at the first shelter.
People and equipment began arriving, and the team went to work.
Press reports
archived on the Radio Response website document this work.
Meanwhile, in California, CityTeam
Ministries reached out to Inveneo
hoping that Inveneo's expertise creating rural
telecommunications systems in Southeast Asia and in Africa
could be used to bring
phones to a project CityTeam was undertaking at the Powerhouse of
Deliverance church in Bay St. Louis. Inveneo people (under the banner
of a new organization, AidPhone)
assembled an Asterisk-based
system which made it possible to transparently use donations of
long-distance telephone service from multiple providers using
only the AidPhone telephones. AidPhone also secured a donation of 1000
analog telephone adapters, and as many analog telephones.
At Mac's farm in Rayville (Northern Louisiana, about 5 hours from
New Orleans) the work to connect shelters to Mac's existing network
was drawing to a close. The abilities of small northern Louisiana
towns to support evacuees were exhausted, and their communication
needs had been met by Mac and his volunteers. There was more work to
be done, but the team had to find a way to be helpful nearer the
heart of Katrina's damage to do it.
At this point (approximately a week after the
storm), AidPhone and Mac's team met. AidPhone had a charter
to work in Hancock County, as a result of their partnership with
CityTeam. AidPhone had donated long-distance telephone service
and IP telephones, but all the Internet service in Hancock county
had been wiped out, so the IP-powered phones were useless.
Mac's volunteers had the know-how and equipment to make a long-haul
Internet link and a community distribution network to deliver it.
Mac's team also had a desire to work in Hancock County, so the
match was perfect. Mac and Mark Summer of AidPhone came to an
agreement to work as independent groups in partnership to achieve
the goal of deploying Internet service and AidPhone's phone service
to Hancock County.
I joined the project on September 14, 2005, approximately 2 weeks
after Katrina made landfall. The story from here on is based on my memory,
my journal
entries, postings on the Radio Response website,
and email archives.
When I arrived, the team had just moved from Rayville
to a staging area in Ponchatoula. There, we had a large air conditioned
office in front of K&D Truck and Trailer Repair. The rent
was paid for several months, courtesy of a donor. The office
formed a
useful base because Ponchatoula suffered only light wind damage
from the storm, and so power and Internet were working reliably.
Phones donated by Front Range Internet of Colorado were
crucial at this point for coordinating donations and volunteers.
We used an Internet connection shared wirelessly with a small
business next door.
The first night I was there, I took on the role of back-office
sysadmin. That night we renamed the organization that was previously
best described as "the guys at Mac's farm" to "Radio Response".
We made a website and moved content that one of our volunteers
(Paul Smith) had previously made to it. We made this effort
because we understood early on that many eyes were on us, and we
needed to be able to explain concisely who we were
and what we were doing. Of course,
while brand-new back-office volunteers were doing this, other
folks were out in the field, making contacts with government
and private organizations, etc. It took some time to spread
the word that we had a new name, and website, etc. We handled
it the best we could, but it must have been pretty confusing to
our contacts to have a group of volunteers change names overnight.
About half of the team was already in Hancock County. At this
stage, they were camping at the Powerhouse of Deliverance church.
They were living self-contained off of the food and water they brought
with them into the disaster area. None of the team had access to
an RV (with air conditioning), so the nights were hot and humid.
The days were frustrating too, because at this point they were
still trying to understand the situation on the ground and make
contacts with the authorities. CityTeam's charter and backing were
helpful, but nonetheless it simply takes time to meet the right
people and gain their trust. Finding our first government liaison
made all the difference.
The difference in comfort at Ponchatoula and Hancock County made some
of us commute, coming back to Ponchatoula in the evenings.
That was a workable system, but not having the entire team together
in the evenings made it difficult for the support folks in Ponchatoula
to know what work was needed. At this time, Hancock County was a
blackhole of communication, at least from the outside in. Cell
phones were working passably when calling out of the region, but
folks there would get too busy and preoccupied to remember to
call back to Ponchatoula to give updates.
As the project in Hancock County was getting underway, a request for help
in New Orleans came in from Joel Johnson
who was working on technology at the site of the Common Ground Collective in Algiers.
Joel had managed to use donated hardware to make a computer lab. For Internet
access, he used an EVDO card in his Mac iBook, and used the Mac as a NAT box
and router. He'd created a second Internet access point using the same
technology at the health clinic. The idea was to provide Internet access to
the residents of Algiers, so that they could fill out FEMA forms online and get
help quicker. With better bandwidth than the EVDO card, Joel also hoped to
provide free phone service.
On September 17 (2.5 weeks after the storm) I visited Joel in
Algiers to assess the situation there and see
how Radio Response could help him. What I found was that DSL was
working in the neighborhood, so a long haul link from downtown New
Orleans was not required. Further, it seemed unlikely to me that
the population in the neighborhood would be well served by a set of
Internet labs as we envisioned in Hancock County. This was
due to the small remaining population, and to their
technological illiteracy (and in some cases, English illiteracy).
Making a successful project there would have required a staff of
attentive "buddies" to walk people through using the computers, and
the right timing with the need to fill out FEMA forms and the return
of the evacuees. The Common Ground Relief folks were preoccupied with
running the medical clinic, a distribution point, and doing journalism
to document the perceived security threats and human rights abuses. It
was unrealistic to expect them to have the interest to run the education
campaign needed to make a big Internet project successful.
It simply didn't all align, and I went back to Ponchatoula having
decided not to attempt a second Radio Response project in Algiers.
I did commit Radio Response to helping Joel
by giving him donated equipment when possible. Later, the NoMesh project came into
being and worked on using the existing DSL in the community to
make a community mesh network. Radio Response passed some of
our donated equipment to NoMesh to help them out.
(Joel also wrote up a document with his lessons learned.)
Most of the people involved in NoMesh had worked for Radio
Response on earlier trips. When they were ready to come back for a
second work trip, they chose to work in Louisiana instead of Hancock
County Mississippi. Having the flexibility to let people redeploy
themselves like this made it possible for people to make themselves
useful in the way that fit their temperament best. I think it was a
strength of the self-organization evident in all of the Internet-related
groups working in the area.
During this time, the activity at Ponchatoula was support work.
Jim Patient and Kevin Cupit were constantly on the phone arranging
donations, volunteers, and other logistics. Aleks Clark was working
on a back-office system to help us keep inventory and volunteers
organized. Then there was the refurbishment lab, operated by Ben Earnhart.
There was a huge shipment of PC's that were in various states of disrepair.
Ben Earnhart of the University of Iowa did a fantastic job of creating a PC
refurbishment lab and managing several volunteers to work through the pile of
hardware to turn unknown (and pretty broken) hardware into working machines
with a copy of Windows 98 installed. We ended up using about 10 person-days
(the long 16-hour days typical of disaster work) to get about 40 machines
prepped for delivery to clients. This huge labor cost forced me to later make a
tough policy decision: I started declining donations of PC's unless they had
been refurbished and had an OS installed before being shipped. This pushed the
refurbishment work to the edge, outside the disaster area. The idea was to make
the best use of volunteer labor both inside and outside the disaster area. The
policy makes sense when described in those terms, but at least one donor went
away from me very unhappy when I declined his donation. Two other donors were
able to accept my conditions and sent 50 more computers ready to go. We passed
those computers on to two clients: Morrell Foundation and St. Clare's Catholic
Church and School.
For the record, we never committed the time to work the phones and get
legal Windows licenses for the machines, though we were pretty sure we could
have gotten Microsoft to agree to such a donation. Instead, we used a freely
available tool we found on the net that can generate license keys (illegally).
We would have preferred to skip the whole question of software licensing by
using Free Software, but at the time FEMA's website required IE 6, and thus we
felt we needed to deliver Windows machines, regardless of the licensing or
management issues. The average machine had less than 64 megs of RAM, so we felt
we had to standardize on Windows 98. We did make a few XP machines, and tried
to place them where administrative work would be happening, as they run Open
Office better, and have dramatically better USB support (for use with digital
cameras, flash drives, etc).
While installing operating systems, we solicited a donation of Deep Freeze
from Faronics. We hoped to use it
to prevent the machines from being destroyed by spyware, adware, viruses,
"helpful" users adding utilities, etc. The initial attempt at downloading and
installing it failed, and I never committed the time it took to solve the
problem. Faronics offered a direct contact with a tech support guy, but we
could not find the time to make use of him. I regret that it turned out this
way, because we did notice later that many machines got pretty broken pretty
fast. It wasn't nearly as bad as I expected, but it was bad enough that we
would have been providing better service to our customers
if we'd gotten the Deep Freeze installation done right the first time.
Around September 19, I left Ponchatoula and started working full-time in
Hancock County. Some support folks remained, as there was more work to do there
(specifically computer refurbishment), but the situation in Hancock County
seemed to be progressing, and it was time to bring in more people to augment the
backbone team and start doing site installs. That turned out to be an
optimistic version of the schedule, but there was worthwhile work to be done in
Hancock County anyway, so it made sense to have people down there, even as the
backbone folks struggled to get things working. At that point the Hancock
County team had secured us access to the EOC, which gave us both a comfortable
base to work from, and also more resources to work with. We started using our
hardware and skills to repair, expand, and otherwise tweak the emerging campus
IP network at the EOC; when there's IT people around, inevitably people ask you
to fix things! Using a port on the EOC satellite uplink, we managed to build a
small network connecting several of the Search and Rescue teams on-site to the
Internet. We also helped the community radio station, WQRZ-LP get their studio
phone line working again (using the Tracstar satellite in the Public Affairs
Office).
IT in general, and IP networking in the EOC was a pretty decentralized
thing. In general, groups were expected to come self contained, and there was
little attention to how to integrate the parts into a sum greater than the
parts. I believe this was a minor failing, but it is hard to see how it might
have been different; groups come and go quickly, taking their equipment with
them. Most groups are too small to bring a dedicated IT person, and those that
do (like the military) might not be in a position to easily share resources due
to reasonable policies that end up tying people's hands. Because we were
completely independent of all the agencies, and had carte blanche from our
donors to "do good" with the hardware on hand, we were able to act as an
unofficial IT organization on the edges of the EOC. The EOC itself did not
effectively take advantage of our skills, preferring instead to use an outside
contractor (NVision
Solutions). We kept clear of them in order to not start any kind of
political problems.
While all this was going on in my world, the folks building the long-haul
link to Gulfport were struggling to meet their self-imposed deadline of "just a
day or two". As I was not present for most of that work, I can't tell the story
of how those links got built. My understanding from listening at meetings is
that there was a delay while MCI attempted to deliver the donated link directly
to Waveland via the newest and least understood wireless technology, WiMax.
Eventually, Radio Response folks got a chance to do the shot with our own
equipment and people familiar with that equipment. We added another tower into
the route to make sure the RF energy did not get lost in the ground; with
a long wireless shot, the curvature of the earth plays an important role. One
of the keys to success was using equipment that the people on-site were
familiar with. The hardware and software that make up these systems is not
incredibly complex, but because of the tight profit margins in the industry,
the equipment is not of terribly high quality (software or hardware).
It is important to know the quirks of the
equipment in order to plan and install networks that work. Simply doing it
according to the manual does not work.
After several days of work the two-hop link from MCI in Gulfport was up and
running (the day was September 21, 3 weeks after the storm, a week and a half
after the Radio Response team arrived in Mississippi).
Meanwhile, the team had also been preparing a distribution network
centered on the Waveland water tower. However, due to dependencies built in to
the design of the network, we were unable to make progress on customer
installations until the long-haul link to Gulfport was done. The problem is
that you need functioning customer sites to shake out problems in the
distribution network, and our customer sites were originally designed to depend
on the DHCP server in Gulfport, on the far side of the long-haul link which was
also not yet completed. Making far flung parts of the network depend on each
other in this way was obviously a mistake, one of many lessons I learned on the
project.
By the time the long haul link to Gulfport neared completion,
there was incredible pressure to quickly show results.
The pressure came partly from
inside the group (motivated people simply wanting to see success)
but also from our partners, who felt they'd trusted us to keep our
word and come through for them, but we were failing to do that.
As the new kid on the block, technologically, we had a responsibility
to manage expectations, then beat them. With many different people
from different backgrounds talking to government and private organizations,
it was inevitable that some of them made promises we couldn't keep.
That endangered our credibility and added pressure to show results.
In an environment where you need the cooperation of many other
organizations to get the job done, you cannot risk losing credibility,
or you risk losing cooperation.
The result of that pressure as the long-haul link was completed was
a very harried day of attempted customer installs.
Despite fielding four teams to at least four sites that day,
to the best of my knowledge none of the installations were
completed. One reason for this frustrating and fruitless day was
certainly technical problems. A much bigger contribution was lack
of pre-planning, equipment preparation, documentation and team
training. Ideally the delay getting the long-haul link ready would have
given the part of the team not involved in the long-haul link time
to do this preparation work. I did not step forward and lead this
work because as a newcomer to this technology, I could not understand
what needed to be done. In retrospect, it is easy to see what was
missing. The prime thing missing was someone to take leadership who
had a background in creating a distribution network and doing customer
installs on that network. The people who could have served as
those leaders were focused on the long-haul work, and were unavailable.
In the next few days, we got the customer sites working that
we'd started on the first day. And with that, the network was up
and limping. As with all operations projects, the nature of the work
changed subtly from "building" to "maintaining" the network. We put
time and energy into network monitoring tools (though I regret that
it took me several weeks more to start using MRTG to make
these graphs).
Furthermore, we lost at least one install crew (and sometimes more)
to maintenance tasks. The network was very flaky at that time,
as we struggled with power problems at backbone and customer sites,
mysterious IP address conflicts and ARP timeouts, failure to acquire
addresses over DHCP, and a flaky long-haul link. Some folks
wanted to move ahead on debugging VoIP problems at that point, but
the network was simply not stable enough to justify using
our time like that.
Hurricane Rita threatened the Gulf Coast about this time, and the team
scattered. Some members chose to weather the storm at the EOC. Others, who
were ready for a weekend anyway, traveled to Ponchatoula, LA and to Pensacola,
FL. The team that stayed behind at the EOC got an incredible amount of
work done, adding evidence to the theory that a small group of motivated people
can accomplish much more than a large group of equally-motivated people!
When we returned after Hurricane Rita made landfall on the Texas/Louisiana
state line (September 25, 3.5 weeks after the storm),
the network was a little bit bigger, but that addition gave us
all kinds of new options. With help from Rescue International,
the team from Southern California Wireless, Bob van Zant, and Don
Castella made a link from Waveland north to the EOC. In addition to
making the network reachable from our lab and sleeping quarters,
this also gave the network access to alternative uplink capacity.
As our long-haul link continued to flake out, we now had the opportunity
to use satellite uplinks at the EOC to serve as a backup link to the net.
The experience of managing a network with one reliable but low
performance link (satellite) and one high performance (but unreliable)
link was quite frustrating. It was
made more difficult by a decision we undertook to only share our limited
satellite bandwidth with customers at the EOC, and not those in the city of
Waveland. Later, we got exclusive access to satellite links from Cisco and from
the EOC, which we were allowed to share outside the EOC. Since that time, I've
considered what kind of technical solutions I wanted available to me to make it
easier to manage a network with multiple paths out the Internet. I will
propose one such system later in this report
in the section titled "Bandwidth sharing box".
During that time, those of us who made contact with customer sites were put
in a very difficult position. With a network that was incredibly unreliable, it
was hard to know, when visiting a site, if it would be working at all. And when
it was, it was hard to promise anything about performance the next day. The
customers were remarkably tolerant of this bad network behavior, but our
credibility was certainly hurt by it, and it hurt my morale, and the morale of
other volunteers to be faced time and again with saying the same thing: "yes
it's down, no we don't know why, yes we're trying to fix it, no we don't know
when it will be fixed". Failover and auto-rerouting capabilities seem like
luxuries out of reach in a disaster-response network, but they are all the more
important because every component in a disaster-response network is all the
more stressed than in a normal network. Note: "Component" above includes
the operations staff!
Around the time of Hurricane Rita, we experienced a major change in
staffing. Leadership and key personnel who had been in place since the arrival
of the group in Hancock County were ready to go home.
This turnover started
pushing me towards a leadership role, as I had been present as a worker during
the building of the network, and would be present for many more weeks while we
operated and grew the network. I chose to let the transformation of
the team happen on its own, as those folks present were not in need of
strong leadership, just an informal coordinating meeting from time
to time. Practically speaking, however, starting around September 24, I was
the on-site project manager, and I take full responsibility for where
the project went (or didn't go) after that point.
During my tenure as the project manager there were two trends: a dwindling
crew, and a solidifying network. After Rita, the crew was dominated by IT
generalists, with the exception of Don Castella, a very experienced WISP owner
from Chicago and Bob van Zant, a wireless ISP installer.
Others included Brent Chapman, Raymond MacKay,
Corlus Nance, Matt Justice and Sean Head. Sean worked on installations with Bob
and Don. Brent and I trained the interns (Corlus and Matt). Finally, Ray, Brent
and I worked on stabilizing the network, via documentation, long debugging
sessions to understand the current behavior, and by ultimately implementing a
network redesign incorporating what we'd learned from trying unsuccessfully to
operate the network as originally designed by the first wave of volunteers.
The redesign also took into consideration new donated
hardware that became available later.
Don provided valuable training to the team
on how to run a solid network. His van-of-plenty continued to turn up
parts we needed to do professional installations long after any reasonable
person would have expected it to be exhausted. Don's generosity with his
materials, and his time, made a huge difference at a time when the project
was really struggling to deliver a stable network.
Bob van Zant was the remaining backbone guy on the project at that point.
Bob's most important task was to fix the flaky link to Gulfport. He tried
several things, but the thing that finally made the link stable was switching
to 900 MHz Trango gear, which limited the link to 3 megabits. Bob also worked
hard on extending the network to Port Bienville, but due to confusion on
aiming the equipment, he never got that link up. After Bob had to go home,
I climbed the Waveland water tower and the Port Bienville water tower in
order to place equipment selected and assembled by Don Castella and me.
Brent led the charge to gather data to justify the redesign, then
made it happen. His first week on the project was unwittingly spent
on the data gathering stage, as he struggled to make himself useful by
doing customer installs and repairing failing things.
One of the other significant events after Hurricane Rita was leaving the EOC
at the vocational school's wood shop and moving in with International
Aid. This was a significant disruption to the team, but I tried to manage
the move with as little disruption as possible. Having an office to work
from at International Aid was very valuable to the team. Having our warehouse
space reduced from half of the wood shop down to our 40' container was
difficult, but by carefully packing we made the best of it, using the
trailer itself as our warehouse.
With the network redesign out of the way and the network behaving in a
predictable and stable way, we were able to start expanding it again. Don
expanded the network north to the FEMA camp at the Equestrian Center outside
the Kiln. This promised to be a great location, because our team, like many
volunteer groups was housed there. Unfortunately, FEMA elected to close the
camp and move the volunteers to NASA Stennis, making our bet on the FEMA camp
not pay off in the long term. We also turned Second Street elementary into a
repeater site, to give us better coverage in Old Town Bay St. Louis. We used
Second Street elementary to extend the network to the Calvary Kitchen, and to
CityTeam's community center at the McDonald ballpark ("Field of Dreams").
We hoped to also provide Internet access to the school when it opened, but
I didn't follow through on that after I left the project.
During this time we had a failure in the middle of the network, separating
it into two pieces. The northern chunk included the FEMA camp, International
Aid, and the EOC. The southern chunk was the rest of the network all the way to
Gulfport. Because we still had exclusive use of a hot-spare satellite at the
EOC, we were able to arrange for the northern part of the network to use it to
get out to the Internet. Several days later, Don got some firemen to climb the
Waveland water tower for us to repair the problem. Later, on one of my climbs, I
inspected the cable that had failed and found damage on it between the ladder
and a VHF repeater put in place by the fire department. I met the fire
department the day they took the repeater down, and they told me they had been
changing batteries every 10 days. That means they'd had several chances to
accidentally cut the cable to our equipment over the course of the six weeks
it was there. The outage was probably almost inevitable, as we were using
indoor-rated cable which was not adequately protected, and the fire guys
are not trained how to work around networking equipment without breaking
things. I relate this story not to cast blame on the fire fighters,
but to point out some lessons to be learned: towers are shared, not everyone
is as careful as they should be, cables will be cut and you have to
design for it.
Another challenge that presented itself during this time was of a political
sort. Students from the Naval Postgraduate School in Monterrey, CA had
installed a
wireless network like ours in the early days. Their network was primarily in
support of government and public safety, but it also reached some private
feeding centers and distribution points. When their deployment was finished,
they left the equipment and went home. Apparently the Postgraduate School
wanted their equipment back. Some government agencies (we don't know which
ones) in Jackson MS let a contract to replace the Naval Postgraduate School's
network with another wireless network. We were told that the contract
gave the contractor exclusive access to the City of Waveland water tower.
A sub-contractor of the prime contractor set their sights on us as amateur
intruders in the way, and either via ignorance or maliciousness, convinced
Waveland's Chief of Police that our equipment was hurting his telephone
service (as provided by the Naval Postgraduate School's network). They
carted off some of our equipment, and we never got it back.
There is a remote
possibility that our equipment was, in fact, conflicting with theirs.
Due to some poorly executed plans to interconnect the networks, our network
might have been connected to theirs without the knowledge of the entire
team. This is why it is critical to have a clear plan for how to handle
multiple uplinks; connecting to someone else's public-safety network,
then breaking their
network, is a very good way to get in big trouble with the authorities.
The Waveland water tower was the center of our network, and having
someone else claim exclusive access to it was a huge problem for us.
We resisted making any changes to the network for a while, relying on
inertia to keep our equipment on the tower; after all, who's going to
climb a tower just to take down someone else's equipment? We were
emboldened somewhat by the fact that we were at the time
providing service to influential aid groups like International Aid
and the Calvary Kitchen. The Mayor of Bay St. Louis ate most of his meals
at Calvary Kitchen, so we hoped we had some pull with him.
The standoff went on for a while until we had a meeting with the contractor to decide how to proceed. Our major concession
was we would commit to staying out of the way of any location
where the contractor was being paid to provide Internet. Mac
Dearman made a deal with the contractor that Mac felt was in our
interest, but shortly after that they stopped returning our calls,
making it impossible to proceed with the deal. Mac and I surmise
that the contract was getting renegotiated, or it fell through or something.
I eventually broke the deal (of my own accord) and installed a connection
in McDonald ballpark, one of the places we were supposed to stay away from.
I did this on my last day on the project because I was unwilling to
leave the community center at McDonald ballpark off the net due to
political problems. When I told Mac what I'd done he agreed it was
a good idea.
The remainder of my time in Hancock County was spent on routine visits to
the customer sites to check on them, make small repairs, extend customer
networks, and so on. During this time I was preparing to hand the network off
to the interns. Corlus Nance was not able to put in enough hours on the project
to get fully trained due to schedule conflicts with another job of his. So it
was up to Matt Justice to learn all he could to be able to maintain and grow
the network in our absence. Don left a few days before I did. Brad Jackson made
a return visit and worked several days during the final hand off to Matt.
I left October 28. Matt's first day running the network was October 29.
In the mean time, Brad repaired the Second Street Elementary site, which
had been disassembled by the renovation workers.
At the time of the hand off, Mac Dearman visited and took custody
of the equipment in the trailer. I had planned to deliver him an inventory,
but Mac, his wife Sharon, and Brad packed the trailer before I had a chance
to do the inventory. I'd finished a renumbering of the network, making
it much easier to understand. I documented
the state of the network in a series of tables, and got help from Brent
to update the network diagrams.
Matt Justice documented his work exceptionally well using the
Radio Response website. His first entry was the 10/29 update. Updates after that
tell the story of Matt's maintenance work and work on new
connections. It also mentions all the help he got from Bruce Barton
of Rescue International, who had helped us all along. Bruce
became especially helpful after I left as another mentor for Matt.
The future of the network is uncertain. Matt's commitment to the project
expired in mid-December, when his semester ended. He's enjoyed the
work, and will probably come down to Hancock County a few more times,
but he will not have a regular schedule in January.
Some customer sites are disappearing. The FEMA camp has been closed since
late October
already (but our network still reaches there). International Aid's
last day was reported by Matt to be December 8. The New Waveland Cafe left the
day after Thanksgiving. However, other sites will be there for the foreseeable
future: the Davis store, Second Street Elementary, the CityTeam community
center at McDonald ball field, and the Morrell Foundation's iCare Village.
One idea people discussed was forming a locally-operated non-profit
organization to take over the network. The social support networks of the
county are in tatters right now. While it might have been possible before
the storm to find an interested board of directors, funding, and so on,
it is currently not possible, in my opinion. However, we did not make
careful inquiries to see what folks in the community thought about this
option. The high school next to the vocational school apparently has
a computer teacher that might be a good resource. The community college
down in Bay St. Louis that was destroyed might be another direction to
explore to find interested people to maintain the network.
Another idea is to select a turn-off date, and schedule a work-weekend
of local team members (Brad, Mac, Sharon, Matt) to collect all the gear.
Once the gear is collected, it could be used as the beginning of
an equipment cache. A cache such as this is described later in this
report as part of my vision for a more successful deployment.
There would be a moderate cost associated with the cache, for a storage
unit, or for the one-time purchase of a shed to be placed on Mac's property.
As of now (January, 2005) I do not know what will become of the network.
Fundamentally, determining its future falls on the shoulders of
Mac Dearman, the founder of Radio Response.
In this section, I discuss the customers we reached, the
applications they used, and the network that made it possible.
The first two are measures of our achievement against our goal
to "be helpful". The third is a measure of our technical
achievement given the situation we were faced with.
Because of the fluid nature
of the network, there may be customers who I forgot
(or who I never even knew we had). This list is current as of
December, 2005.
One thing we learned is that making a long haul network
takes time, lots of time. The incredibly quick successes Mac's team had in
northern Louisiana were due in part to Mac's network already existing. The hard
work was done long before the need arose to bring the churches online. In
Hancock County, Radio Response was called on to build both a long-haul and a
distribution network from scratch. That took time, and that time delay affected
the way our network ended up being used. For instance, one much touted
application of Internet technology to Katrina, finding loved ones using
services like KatrinaList.net was
already a solved problem by the time our Internet access was available; people
already had found their loved ones some other way. We rarely witnessed people
filling out FEMA forms online using our computers. They had already done their
FEMA paperwork at the FEMA field station just 100 feet away from our Internet
lab. The single biggest use of the net was for ordinary "I'm hanging in there"
type email. One lady sat down and said, "Thank god! Now I can pay my bills!".
Another young woman sat down and started looking for a new job, as the day care
center she'd worked at before the storm was now closed. Finally, out at the
Davis Store, I heard that they found an email address for the state
unemployment insurance office that allowed them to get a question answered even
though the phone lines were jammed.
It would not be a stretch to say that an equal amount of usage came from
disaster relief workers themselves. First, organizations used our Internet
connection to communicate with their home base, making them more effective than
they would have been with cell phones only. Second, individuals used the
Internet connection to explain what they were experiencing to friends back
home. They sent out email to worried parents and posted to blogs. Sharing their
experiences like this helped attract more volunteers and resources to get the
job done. In fact, the Radio Response blog contributed to getting extra waves
of volunteers that we might not have gotten otherwise. I've also seen spikes
on the traffic graphs that lead me to believe large files, probably videos, are
being uploaded from the network. For a sample of the kind of video being
published from the Katrina damage area, search Google Video for
Waveland. I don't know for sure, but it's possible some of those were
published over our network.
Out in the Internet, a huge amount of attention went to the problem of
moving health and welfare messages. Frustrated engineers and other Internet
users trapped by circumstance in their hometowns channeled their desire to help
into systems like Katrina List.
Google and other companies aggregated the data into public search engines like
Google's hurricane-specific people-search page.
However, our network was up and running too late to be of any use for
posting information like this. By the time our system was in the hands of
residents, they had already found some other way to report their status. It is
likely that users of our network were searching public databases for the status
of other people, but they were not actively filing health and welfare reports
over our network.
Another storm-related Internet use that had a significant amount of
attention was applying for government assistance over the Internet. By
observation, it seems our network was not used much for this. In all my site
visits, I never saw anyone doing a FEMA application online. I don't know if
SBA loan applications were possible online, but it's a moot point because the
best place in the county to do SBA paperwork was at the Small Business Recovery
Center created by SBA and the Chamber of Commerce at the Coast Electric
conference center. There, they had public fax service and satellite Internet. I
never got a chance to visit, but my understanding from radio interviews with
SBA folks was that there were enough counselors on hand there to personally
help each applicant. Likewise, FEMA set up a processing center in the K-Mart
parking lot. With that kind of support, it's no surprise that people were not
using the Internet to apply. They should have anyway; I heard a report in
Algiers that well educated residents with easy access to the Internet out of
the area got FEMA financial assistance within days of filing an Internet claim,
while their poorer, less educated neighbors waited in Algiers for FEMA
representatives to come door to door. It stands to reason that getting your
application submitted as soon as possible via the Internet would be the best
strategy.
I suspect there's also a self-selection effect at work. My impression
of what websites people were using was only from observation of our
computers. Laptop users could be expected to be more comfortable with
doing financial tasks online. So they might have been using the FEMA
website and I just didn't ever see anyone doing so.
As Rita came ashore, time and again I'd find people using our
computers to track its progress. Access to television was severely
restricted by the living situation of most residents. Radio was
widely available, to those that had cars, and to those that remembered
to pick up a hand-held radio at one of the distribution points.
Because our computers were at feeding centers, it was easy to drop
by after a meal to check on the status of the approaching storm.
The public computers were used the same way public computers anywhere
(libraries, Internet cafes, hotels) are used. The most common use was web-based
e-mail (Hotmail, Yahoo, Gmail, and so on). Though we did not supply the
computers with IM clients installed (a simple oversight, not a policy
decision), most computers sprouted IM clients immediately since we
allowed users to install their own software.
One problem sometimes encountered with public Internet terminals is for
people to view objectionable material. We had no reports of this problem, though
some of our customers were worried about it when we brought the computers in.
One reason why is that we were always careful to place the computers such that
the screens were visible to the public. With no privacy to indulge in bad
behavior, people don't. Arranging the computers like this is a trick I learned
from working in an Internet cafe in Guatemala.
Here are some other things people told me they were using the
computer for:
A loudly-touted application of the network was to be restoring telephone
service rapidly, using Voice over IP (VoIP) technology. In reality, VoIP
arrived on our network later than e-mail and HTTP access. The reason why is
that the e-mail and HTTP protocols were developed at a time when connectivity
in the Internet was much slower and lower quality than it is today. The
protocols have built into them (either explicitly, or implicitly) an assumption
that the underlying network will be slow and unreliable, and as a result they
degrade gracefully in such a network. VoIP, in contrast, only came about in the
last 5 years or so, and has mostly been developed in an environment of cheap,
high speed, high quality (low packet loss and low jitter) networks. There are
legitimate reasons for why VoIP is engineered as it is, but the bottom line is
that we were not able to deploy VoIP until we were able to deliver a very high
quality network. This meant that VoIP came much much later than other
applications, so late that alternative modes of making voice contact had
already been in wide use. My observation is that in our deployment in Hancock
County, VoIP was an order of magnitude less useful to the citizens than
simple HTTP access.
Cell phone and public telephone service was widely available and
reliable by the time we were able to provide VoIP-based telephone
service. This meant that our telephone service was considered by
most customers as a nice touch, but of secondary importance to the
public Internet services.
The one exception was at the Davis store. This location was several miles
outside of town, where the BellSouth public telephone banks were located. As a
relatively poor neighborhood, cell phone penetration was very low. People were
unable to get new cars to replace cars destroyed by the flood. As a result, the
VoIP telephone at the Davis Store was the only telephone within walking
distance for about 300 people in the neighborhood.
Nonetheless, a user at the Davis store told me that he was getting
busy signals from the Mississippi State Department of Employment.
He sent an email and got a reply within a day. Further proof that the
store-and-forward technology of the traditional Internet beats the
new VoIP technology, unless the communication task must involve voice
and must happen in real time; virtually the only task that requires
VoIP seems to be letting grandparents hear grandchildren's voices!
As we operated the network in a maximally open manner,
I'm certain we had customers and offered services that we never even
knew about. It would have been possible for unseen laptop users to
join the network; it would have even been possible for someone to
set up a wireless bridge from one of our customer sites to their
own network.
As an aside, there were many more laptops present than I expected
for a rural community recovering from a category 5 hurricane. On
reflection I believe it is because a laptop is easy to move.
I surmise that many people packed the laptop when they evacuated,
and brought it back with them when they returned. A number of people
commented to me, "I used to have a desktop computer like this, but it got
destroyed in the storm."
The ability to fax was a common request. We were unable to offer analog
fax machines with our VoIP configuration, but it seems likely that enterprising
users with laptops were able to use digital cameras or scanners and Internet
fax software to make their own fax service. Huge outbound bandwidth spikes
from time to time imply that people were publishing video from our network out
to public hosting services, for instance Google Video. I never witnessed
applications like this on our network, but the beauty of the Internet is that
Radio Response did not need to plan for all the possible applications. By
providing simple IP (even IP NAT'ed behind two routers) people could use the
network for what they needed, when they needed it.
The network went through two distinct phases, with two very different
designs. Each was an achievement, as was the transformation from one
to the other.
The initial network design was driven largely by the requirements of the
initial VoIP hardware we had available for the project. The design also had a
certain KISS (Keep It Simple, Stupid) aspect to it. In retrospect, some
simplifying features of the network were untenable. Another limitation on
network designs was the paucity of customer-edge equipment available to us.
Designing a network for the needs of one particular application is widely
agreed upon to be a bad practice, but as the team members arrived on scene with
a vision of the project that focused on telephone service (to the exclusion of
more traditional protocols like HTTP), it was somewhat inevitable this might
happen. Add to the focus on VoIP, the difficult requirements
presented by the VoIP hardware (DHCP virtually required and only one layer of
NAT allowed) and the design was basically set in stone before any
alternatives could be considered.
The design called for a flat network using Ethernet bridging from the Cisco
in Gulfport all the way out to the farthest customer device (PC or VoIP
telephone adapter). The Cisco was to be configured to be the only DHCP server
on the network, and also to provide NAT services for the network. The network
was 10.10.0.0/19. The DHCP range was originally slated to be the entire range,
less the router IP address, 10.10.0.1. Radios were to be given addresses from
192.168.0.0/24, with no provision for packets to be routed between 10.10.0.0/19
and 192.168.0.0/24 as a "security measure". This set us up for a situation
where routine diagnosis was impossible from behind NAT boxes,
and network management
tools would need to be dual-homed to monitor the entire network.
An IP allocation plan in the 192.168.0.0/24 net block fairly quickly proved
not to scale as the network grew, so that eventually the addresses in
192.168.0.0/24 were hopelessly scrambled, requiring that workers refer
to an up-to-date network diagram in order to have any hope of understanding
the network.
The first problem we found with this design was that there were no static
addresses available for customer equipment that needed to be statically
assigned, like print servers. We requested a range of static addresses
from the router administrator to solve this problem.
The next problem we quickly saw was that DHCP was flaky or outright
broken at customer sites. The problem seemed to be that broadcast
traffic was being blocked in various parts of the network. Because none
of our customer premises equipment supported DHCP relay, we were counting
on broadcast working right from end to end in the network. It didn't, but
we never figured out why, exactly. The Trango firmware supports features
related to clamping DHCP, as do the Nortel switches that were donated
to us, and which we used at every customer site. We tried to disable
all broadcast blocking, but it's clear we were not successful.
Brent and I saw, but did not successfully diagnose, situations where two
computers on one physical LAN behaved differently. For instance, one could ping
the router in Gulfport, and the other could not. When dumping the ARP table on
the router, the MAC address of the broken machine would appear to have been
proxy ARP'd by some other part of the network. Fundamentally, our network
design depended upon wide-area ARP working correctly, and in our network
broadcast packets were not reliably being passed, so ARP was not reliable.
IP address conflicts on both the 10.10.0.0/19 and 192.168.0.0/24
network happened a few times because we did not have reliable record-keeping
mechanisms.
As a result of all these problems, mixed with flakiness on the
long-haul network link to Gulfport, our network was exceptionally unreliable.
Worse than simply being broken, it was behaving erratically; sometimes
things worked right, encouraging customers to keep trying, then discouraging
them when things failed again. It was a very frustrating network to work
on, in part because I watched it get built and though I had no better
idea to offer the team, I had half expected the design to have these types
of problems.
Several things worked together to make a second design possible.
First, it was necessary. The network was so unmanageable, we simply had to
do something. We were unable to grow the network while chasing after bugs, and
our customers were losing patience with us, going so far as asking us how they
could order satellite connections to replace our failing connection.
Second, the natural turnover of staff brought people into the project
(specifically Brent Chapman) with new energy, experience in situations like
ours, and with no history on the project. The turnover also sent people with a
vested interest in the first design home. Brent had nothing to lose by
proposing and implementing a new design. Furthermore, I used my tenure on the
project to give me the authority to make decisions on the behalf of the
project. I encouraged Brent to fix the network, and promised him I'd run
whatever interference I needed to so that he'd not have any political pressure
on him for doing so. Some would argue that with such a small project, for such
a good cause, there should be no pride nor politics to overcome. To those
people, I'd respectfully request they remove and discard their rosy spectacles.
People are people, and people under pressure behave even less reasonably
than you'd normally expect.
Another thing making a redesign possible was our realization that VoIP was
not the killer application of our network; HTTP was. This we could see by the
willingness of the customers to brave our flaky network in order to get their
email by simply hammering on the reload button when things didn't work right.
All this time, we were unable to devote time to debugging the broken VoIP
phones we first deployed. Nevertheless, customers weren't complaining about the
broken phones; they wanted reliable HTTP not VoIP.
The last thing that figured into the redesign was that we received 10 out
of an eventual 50 Linksys ATA's (consumer-grade broadband
(ethernet-to-ethernet) routers with built-in analog telephone adapters). These
gave us a cache of equipment which could be configured identically. Together
with the Trango SU's we were already using, we were able to create a
standardized demarcation between the core network and the customer networks. The
Linksys provided a local DHCP server. To get that benefit, we had to add a
second layer of NAT, but that made it easier to understand customer sites,
because they could always be configured with the same subnet, making training
of technicians easier. The second layer of NAT did not affect the VoIP
implementation, as it had access to the external address on the routers (though
we proved later that the Linksys ATA can also work behind multiple layers of
NAT fairly reliably using NAT keep-alive messages).
The final design is documented by Brent on the
team wiki. The key to the
new network design was to get rid of DHCP on the backbone, and carefully guard
access to the backbone. Together with a new numbering scheme I
implemented after Brent left, the network took on a stable form that others
have been able to maintain after the team that deployed it left,
an attribute that the first incarnation didn't have.
I believe this design could be used for a pre-staged network in order to
reduce the amount of configuration (and therefore, time and expertise)
needed in the field. I will propose such a network later in this report.
In the following sections, I put into writing lessons I learned while
working on the project. They are in an order that makes sense to me, but
practically speaking they all basically stand alone.
In Hancock County, access to the Internet was needed and appreciated.
Telephone service was available for affected citizens, but it was not
convenient to those without cars. As a result, our VoIP service had value,
but was far from the only way people could communicate via voice.
Cell phone service was reliable within a few weeks, and cell phone vendors
set up tents to sell new service to those who were living without landline
service for the first time in their lives.
Aid agencies, both private and government, take cues from corporations on
how to conduct themselves. Both FEMA and Red Cross depended to a huge extent on
telephone service working. Their behavior in this regard was strange, as it
seemed to disregard the reality that close to 100% of the victims from Hancock
County were without reliable personal telephone service. Signs popped up on
shared telephones urging the lucky few who got through to the severely
overloaded FEMA or Red Cross call centers to keep the agent on the line and
hand the phone off to other citizens. Since aid agencies evidently prefer to
receive requests via telephone, groups like ours that seek to provide telephone
service will always be welcome. It might make more sense, however, to cut the
dependency on groups like ours, and simply offer the services in the field,
without hiding behind a call center.
Our approach to deploying IP needs refinement. Our approach was to build a
long-haul network from Gulfport to Waveland, then build a distribution network
inside of Hancock County. Because the same experts who were building the
long-haul network were needed to make progress on the distribution network, the
two ended up being deployed in sequence. It would have been better to
concentrate on distributing locally available satellite bandwidth first, then
finish the terrestrial long-haul network and switch over to the higher quality
terrestrial network later. The easy availability of satellite uplinks in the
disaster area surprised us, making our approach of "deliver IP into the region,
then distribute it" the wrong approach.
A free wireless ISP (and telco) in the middle of a disaster is
useful for private organizations that are telco-challenged. Rich, well
prepared organizations bring a van with a DirectWay satellite unit on
top. Organizations with connections at the EOC can get the incumbent carrier
to expedite landline phone service restoration for them. But the majority of
small teams, even from rich organizations, benefited from having
experts take care of networking, so that they could concentrate on what
they are good at. For example, we provided telephones and Internet to
a team from Pfizer, who was distributing drugs to local clinics. We
also provided Internet service to a Navy Seabee base (probably for
morale-related use, not operational use).
It is unclear how many of these observations are applicable, as they come
out of Hurricane Katrina, whose scale was so huge. It could be that
lessons taken from Katrina won't be useful for the next 20 years' worth
of hurricanes. I
was too busy with Katrina-related work to watch carefully after Rita and Wilma
to see what the needs were. There were calls from people outside the disaster
areas for us to go to those hurricanes too, but one thing I learned by working
inside a disaster area is to ignore the people outside, and only believe the
reality on the ground. Without having visited east Texas or Florida, I don't
know what the needs really were.
The technology we used (Trango Broadband long-haul and distribution
equipment, outdoor 802.11b equipment, and consumer-grade home networking
equipment) was appropriate for the job, but it did present some problems.
The quality of the engineering of the software (and to a lesser extent,
hardware) is very low in these types of devices. Software bugs are very common,
and unless you are using a particular "blessed" version of the firmware,
behavior is far from predictable. Because most people on the project were not
familiar with the devices (thus knowing the features to avoid, and the
blessed version numbers), it was very hard to tell the difference between
mistakes we were making, software bugs, and hardware failures. This was not a
theoretical problem; we saw all three types of problems, sometimes two at a
time (making them exponentially more difficult to debug).
Software quality problems and all, our technical approach was still the
right one to take. Because these are simple, cheap devices meant to be
integrated by relatively inexperienced network engineers (or in some
cases, completely untrained home users), they are easy to use
in an environment with lots of people of differing backgrounds. And because
they are loosely coupled, they still work right when other things aren't
working. A closed system that depends on a proprietary configuration server
would be dead in the water when the configuration server lost power, for
instance (a common occurrence in a disaster area).
Though it is difficult to remember in the heat of battle, "A
good plan today is better than a great plan tomorrow". This mantra
lets you make progress today, but builds in to the network problems
that are going to stack up and bit you later. So you have to learn
that when you are operating day by day on what could charitably be termed
a "good plan", you must schedule time later for rework, to incorporate
the unknowns the "good plan" glossed over. This is true in all network
design, I think, but it is a bigger deal when the cycle time is so short;
a network built last week might be ready for significant rework this week.
This is a common problem in the emergency management context. Normal
management skills and techniques are not useful during the period when it
is impossible to plan more than a day ahead. Leaders who are successful
in this environment are grown, not trained. Thus it is important to
have continuity in an organization. Holding at least one drill before
hurricane season, organized and lead by the person who is committed to
leading an actual response would be the ideal way to grow such a leader
inside of our group.
Much has been said about the "fog of war", and the "chaos of disaster
areas". It's true, all of it. And yet, it is manageable. Experienced
agencies know how to make a dent in the problem, but with weak technology
backgrounds, they might not even be getting as far as a combined
emergency management mindset plus technology could get. People attracted
to the Radio Response project are familiar with tools to manage information,
but didn't know how to work in a disaster context. There is definitely
room for improvement.
Below I identify some of the lessons we learned about getting and
using information. I tie it together at the end with a proposal for
how I'd do it differently next time.
It is unrealistic to wait around for someone to tell you what to do.
The authorities don't know any better than you what needs to be done.
If you expect to get direction, or even accurate intelligence, from the authorities,
you'll be disappointed. It's not a matter that no one knows the answers
to the questions, just that the people you have access to don't know
and don't know how to find the data before it becomes stale. There
was a daily coordination meeting in the morning, but we were not invited
to listen in. Our government liaison, Bill McCusker, shared what he could
from these meetings, but it wasn't very helpful.
The authorities bring Internet access with them to support themselves.
Their priorities were understandably on other aspects of the relief effort, so
they are not too interested in a project like ours to bring Internet to
citizens. That doesn't mean the our project is unappreciated, it's just that
the limited capacity of the county emergency managers did not give them the
luxury of giving us detailed briefings, etc. To those who argue this is
temporary, and that eventually emergency managers will see Internet access a
necessity, I disagree. The priorities are transport (without which you can't
move resources to solve any of the other problems), then communications, then
survival commodities like water and (later) food. Communications is a very high
priority, but the needs are met with a small set of linked VHF repeaters and
standalone satellite connections, not with an Internet distribution network.
So if the authorities don't tell you where to go and what to do, how do you
find out? A huge amount of it comes from chance encounters, and these are
facilitated by driving around and talking to people. It seems incredible, but
it works. Like minded people sort each other out. Radio Response might have
been using this "network" in the early days in Hancock County, but I don't
know; I wasn't there, and information dissemination inside of the organization
was too spotty for me to know what information others were gathering.
Another reason you need to gather information outside of official channels
is that the authorities don't know everything that is happening, and can't.
American society is broken into many classes and divisions, and while
soft-focus human interest stories try to tell you that disasters bring us
together, the opposite is in some ways true. Traditionally disenfranchised
communities get forgotten by the authorities (not out of malice, but because
the authorities are overwhelmed and the disenfranchised community doesn't have
the contacts to be heard). Fear of racism might keep blacks from seeking help
from a city government they have long perceived as white dominated. Some
groups help themselves, and do not ask for outside help. Some even end up
rejecting outside help due to conflicts with authorities. People with legal
judgments or arrest warrants outstanding will refuse to come to official aid
stations. As a non-partisan aid group, we had a responsibility to reach out to
all kinds of groups, not just the ones the Hancock County EOC knew about.
One way we found a new customer was just like in a commercial WISP: word of
mouth. A volunteer with International Aid liked our service and recommended us
to the Morrell Foundation. Likewise, a church pastor visited International Aid
to place an order for his distribution point and we set up an appointment with
him for a site survey on the spot. He came looking for bottled water and left
with a promise of Internet access.
Though it came too little and too late, the EOC started trying to
provide networking events to let volunteer agencies tell each other what
they are doing. That probably would have been very helpful to us, had we
not been shifting into a maintenance-only mode by the time the meetings
started happening.
Before we start talking about how to manage the flow of data better,
let's list the kind of data we were dealing with:
The amount of data is substantial, and it comes in forms other
than just text. However, even only having an up to date copy of the text
info printed out would have been useful at times. I took to carrying
a printed phone list and network diagram with me tucked into my notebook.
Other team members came up with other techniques that worked for them,
more or less. But on average, I think it was safe to say that people
didn't have the data they needed, when they needed it.
Once you get that intel, how do you get it into the hands of
people who need it? The kind of people working for Radio Response are
used to using e-mail, wikis, and databases to manage information,
so it was inevitable that various people would propose to do so,
especially people "at home", separated from Hancock County by thousands
of miles, but wanting to be helpful. This was pretty much a failure,
and the reason why was simple: connectivity.
It should have been obvious, but when you go to build the Internet
someplace, attempts to keep yourself organized using the Internet are not going
to work very well. If you get access long enough to enter data, then you'll
likely not have access later when you need the data.
Sharing the information over the Internet is clearly extremely
valuable, but the problem is it can't be the only way information
is shared. So whatever systems the team uses need to work locally
until Internet access is stable, and then need to work remotely as well.
There are two choices where to put the server the team is using:
on-site in the disaster area, or out on the Internet. The latter
appears attractive for a number of reasons, but since the majority
of the updates come from the people on the ground, and they will
need the information even when they can't talk to the public net, it
is best to put the server on-site with the workers, then use
some kind of script to do one- or two-way replication with the
public copy of the wiki.
My proposal for how to handle this is to use a local wiki,
with custom software to sync the local wiki to a remote one when possible.
Such wiki-syncing software might already exist. The local wiki would need
to live on a laptop dedicated to the job, not on someone's personal
laptop that will leave with them.
This is much more than a technical problem, of course. It is fundamentally
a management problem. It takes leadership to convince the team that
investing the time in gathering and exchanging data at the end of the day
will make the team more effective the next day. I would assign someone
the job of interviewer and reporter. They would gather data from people
in the evening into the locally hosted wiki. They would then print a
packet of the fresh data (specifically a phone list, a customer list,
GPS waypoints, and a map) for each team member to be handed out at the
morning meeting.
This would have to be a priority from the highest levels of management
(i.e. the most respected team member). When I was filling the role of the
reporter, it was considered a luxury at the end of the day to chat about what
happened, not a necessary debrief. As it was in an informal bull session
"around the campfire", people did not have their notes with them, so gathering
contact info and GPS coordinates was impossible.
One thing that I learned by watching the backbone guys at work is that
planning and installing a microwave link is hard. It calls upon a varied set of
skills from esoteric things like RF engineering, through political maneuvering
and salesmanship (to get access to towers), to hard, sweaty, dangerous work
(hanging radios at 200' above ground level in 100 degree temperatures). It is
a job that requires a critical mass of highly trained people (ideally four
people: one team of two on each end of a link). It is a job that does not go
faster with more people, and sometimes is limited by things outside your
control (weather, political climate, RF interference). It is not something that
can be scheduled, nor rushed to completion. Installations take longer than
you'd expect, and require an array of special tools and supplies (U-bolts,
antenna pigtails, waterproofing compound, cable ties). Installations have to be
done carefully and to the highest standards of workmanship, because climbing
the tower to fix something is arduous and makes
for long outages. If any of the required items (radio, supplies, trained
people) are missing or turn out to be unusable, any hope at keeping to
a schedule goes out the window.
In contrast, satellite connections are fast and easy to set up.
Some systems aim themselves. Higher bandwidth systems need to be
professionally installed, but it can be done with one or two people
in an afternoon. Most satellite connections are set up to allow
visiting users with a laptop to connect into it.
As a result, satellite bandwidth is fairly easy to come by
during the time it takes to engineer a long-haul link. A network that
can take advantage of those differing uplink technologies would
be up and running faster than one that is designed with the assumption
that it will be using a long-haul terrestrial
link for its only connection to the Internet.
In this incident, the Hancock County government was the prime controlling
agency for the recovery effort. Such local-level control might always be the
case, I haven't seen enough disasters to know for sure. In the United States,
our preference for local government is encoded deep in both our culture and our
laws. We have a distrust of the "feds", and laws on the books that limit
federal power. There are certain things that state and federal government
cannot do, even in a disaster, until the local government invites them in. On
the plus side, this means that decisions about the future of a community are
made by people from the community. It also means that working with the
emergency management people is going to be more a matter of personal
relationships than official policy. If a county employee trusts you to
git'r'done, then you'll be free to do your work, without someone from
Washington DC asking you why you are doing it. On the negative side, county
emergency managers are likely to be less trained, and less versed in technology
like community wireless networks. It wouldn't do any good to make our case to
the FCC and expect the FCC to be present in every disaster; disaster response
is controlled by local people. So, immediately building personal relationships
inside the EOC is critical to success. Finding a government liaison who was
excited about our mission made all the difference.
The response to Hurricane Katrina in Hancock County was facilitated
by use of the Incident Command System (ICS). It is a pre-planned
organization system that is designed to scale from a single house fire
up to a Katrina-sized event. It is commonly used throughout the United
States. Its origins are in wildland fire fighting in California in the
late 70's. ICS training is widely available on the web, and formal
education is available from FEMA, and through individual states.
The ham radio community sponsors training in ICS via the ARES/RACES
system of volunteer disaster communication teams.
It would have been useful for several Radio Response team members to
have been trained in ICS, so that the operation of the EOC would have
made more sense to us. Our first government liaison, Bill McCusker,
did a good job of making sense of the situation for us. After Bill
went home to Florida, it was up to us to integrate with ICS ourselves.
One aspect of ICS that is critical to understand is the
Emergency Support Facilities (ESF). As communications providers,
we are part of ESF-2, Communications. However, as volunteers, we also
need to be in touch with ESF-15, volunteer coordination. And finally,
to get access to EOC facilities (for instance, to get warehouse or lab
space, or access to a water tower) we needed to talk to ESF-5, Planning.
People in the EOC are of two minds when it comes to volunteer groups like
ours. There's a tribal, us vs. them mentality that happens everywhere in human
society. They wonder, "what are these amateurs doing getting in the way of the
professionals?" This is a problem that ham radio operators have suffered for
decades, and it's unclear it will ever get better. Luckily, in Hancock County,
distrust of amateurs was at a minimum, and cooperation ruled the day.
Regardless, we still felt a fair amount of pressure to prove ourselves quickly.
This lead to some poor quality work during the first few days which we paid for
later. Perhaps this is how it always has to be in a disaster context, I don't
know.
One sticking point is RF emissions. It seems that most people who
handle radios for emergency operations do not understand electronics,
physics, or RF propagation. I don't know what their backgrounds are, but
in my experience, they consider any non-government use of RF equipment
a threat to their turf. It's important to remember, too, that a job
like frequency coordination attracts controlling personalities. After all,
if you get your kicks by telling people what to do, what job could possibly
be more rewarding than telling a bunch of visiting police departments
that they are not allowed to use their toys? Of course, personalities like
this are rarely swayed by facts, or by regulations. It doesn't do any
good to say, "Our devices operate in the unlicensed 900 megahertz band."
All they hear is 900 MHz, and they say, "900 is already in use by the
radio station, use some other band, or I'll have you arrested." Our response
to this declaration was to go talk to the radio folks (who were
already customers of ours, and thus loved us) and confirm that they
were seeing no interference. Sara Allen of WQRZ-LP was more concerned
that their studio to transmitter link was causing interference to us!
You can and should ask for favors from your customers. They
are willing to "pay" for their Internet service by bartering.
We ate many of our meals at kitchens we'd provided with Internet
service. We got office space, a place to park our trailer, and even
delivery services from International Aid. Towards the end of my
time there, I was even living with them, sleeping on the bed in my
car and using their bathrooms and showers.
There is a long tradition in the United States of integrating
the services of volunteer groups into the operations of the professional
emergency response teams. For one thing, it wasn't too long ago that
literally all emergency response in the United States was done by
volunteer fire departments. In rural areas, volunteer fire departments
are still the rule. So there are a large number of volunteer search and rescue
organizations that are deployed to help find bodies after a hurricane.
These organizations benefit from having a free wireless ISP.
They were some of our most appreciative customers. The nice thing
about hooking up SAR guys is that they have skills and equipment that's
useful to us, so once they've sent email home to their wife (or use iSight
and iChat to see their kids!), they owe you one, and you can get them
to climb water towers for you, or loan you UPS'es.
They are typically under-used, because they are typically over-deployed.
You'll sometimes see three SAR teams camped out when there is only work enough
for one team. As a result, they are interested in helping out wherever they
can. If you can give them a job, they'll get it done for you. This was also
true of professional fire fighters assigned to Hancock County on mutual aid
contracts.
One thing to understand about SAR folks and fire fighters is that while
they may be eager to work, they do not know how to do neat and tidy installs.
It's the facts of life, and you live with the results, but you should at least
be aware of the problem going in to it. One of our biggest outages (losing AP
1, linking Waveland water tower to Stennis water tower) was probably due to
fire fighters replacing batteries in a repeater. They smashed the cable our
team had left unprotected, cutting several of the pairs of copper. Whose fault
was that outage? No one's really. It's just reality in a disaster environment.
Our equipment was not even labeled, so they had no way to contact us, if they'd
wanted to.
The team that gave us the most help was Rescue International. Bruce
Barton of Rescue International loaned us lots of equipment, and arranged
to have one of his guys climb Stennis International Airport's water tower.
Later, once I left the project, Bruce worked with Matt Justice to help
bring new sites on line. He was so helpful he became an honorary Radio
Response guy.
We didn't have any interactions with the electricity companies, but
to me, they seem to be the ideal partner for us in the future. Here's why.
First, by definition, where they are, the power has been restored.
They have lift trucks, making it easy to do tree-based installs at customer
sites. They seemed to me to be way better organized for rapid response
than every other organization (certainly than the EOC). They had a comfortable
place for their guys to sleep, and to eat. Their camp was eventually
taken over at the FEMA camp at the Kiln, but only after the electric
companies had already finished their work and gone home.
Finally, their network gets repaired at approximately the rate
ours grows. For instance, getting 30 miles of transmission line back
into service in a week's time would not be unreasonable, just as we
were able to get a 30 mile Internet link done in about a week. Each
mile of power lines is much simpler than the same mile of telephone
equipment, just like us.
The one trick with talking to electric companies is that they already use a
lot of RF and Internet technology, and you'd need to be sure they understood
where we are coming from as a voluntary provider of community networks. They
use RF to send SCADA (Supervisory Control and Data Acquisition) data from
remote parts of their network. Some power companies are also
getting involved in
Broadband over Power Lines (BPL), and might consider us a competitive threat.
An open question: what can we do for them to get them to let us
ride their coat tails? How do they do command and control while rebuilding
the network? Could we get them to hire us to restore Internet to their
facilities, then we use those facilities as distribution points?
Because our response to the Katrina disaster was ad-hoc, we we were
building the organization while trying to get the job done. Even harder, we
were building a coalition of organizations. A certain amount of time and effort
was wasted on butting heads. I suspect that folks from our partner,
Inveneo/AidPhone, felt that they were not respected as much as they would have
liked, and that Radio Response people "took over" the project. I'm sympathetic
to such a complaint, but when you are working in partnership with volunteer
groups whose membership ebbs and flows, power shifts happen.
The situation would have been easier with a single project manager that was
committed to working on the project from start to finish. Less effort would
have been spent on transitions and personal conflicts. The results would have
been strongly dependent on the effectiveness of the manager. They would need to
be able to make good priority decisions, be able to attract and motivate hard
workers (while weeding out tourists), and able to enforce decisions.
The right answer is probably the organized chaos we ended up with, but it
is frustrating. Using seasoned people, who have been through real events and or
drills, and have a pre-existing personal relationship would be helpful. That
would tend to weed out the tourists, and let people become accustomed to each
other's working styles before they get on scene.
We needed a working printer available to team members. Printers were donated
to the effort, but we did not make an effort to have one permanently
connected in the office, with easy-to-access drivers, and so on. This
was an oversight, and it would be easy to fix next time. Having a printer
easily available would help with distributing info ("Let me just print
out a copy of the current info packet before you go out on that job")
and also getting equipment labeled.
We did not have outdoor 802.11b gear that we could trust. We had a
big donation of Deliberant 1300A AP's, but the problem was they
seemed to be returns. Some of them were already configured, and
several had hardware problems. The other problem with Deliberant 1300A's
is that they have no field-accessible factory reset switch.
We also had a lot of different Linksys devices available to us.
They occasionally exhibited bad DHCP behavior, which was difficult to
diagnose, and could not be repaired remotely.
We needed to use heavy-duty UPSes at backbone sites, and light-duty
ones at customer sites. Donations of both arrived when the need became
clear, but it was something we should have had on hand from the
beginning.
We should have made name tags/badges and business cards for ourselves. They
give you credibility. Also, people tend to fall back to physical systems
(scribbling on little pieces of paper) when everything else is broken. You
can't give someone your e-mail address when they are sleeping in a tent and are
waiting for you to make their Internet connection out of spare parts.
The names for things were fluid and non-standardized. Even our
group changed names when I first got there.
It is important for people to understand the benefit of using
standardized names for locations in the network. Perhaps with
a better documentation and team update system, naming would
have been less of a problem. Disagreements on whether it was
Baypoint, or the Davis Store would quickly fade away when the
morning information packet showed one name only.
The telephone system behaved as well as could be expected.
Several things came to light that I did not know before:
First, cellphone service, at two weeks after the storm, was
acceptable. It was sometimes difficult to make a call, to be
sure, but the signal strength was always quite high. Getting
power to the cell sites quickly was obviously a priority. They
also added a lot of cell sites, using Cells On Wheels (COWs) or
temporary installations. SMS was much more reliable than voice,
however. Getting more team members comfortable using SMS
would have been time well spent.
Second, VoIP works around telephone network congestion. It was very
difficult to call inbound to Ponchatoula two weeks after the storm. The
recovery effort was in full swing, and the demand on the network was much
higher than the small amount of rerouted capacity could deal with. The VoIP
phones in Ponchatoula never exhibited the same problems the landlines did.
Why? Because the last telco hop for them was in Colorado. From Colorado to
Louisiana, the call traveled on the Internet. Colorado wasn't having a
disaster, so it's lines were ready to take as many calls as we were receiving.
This is an important feature of VoIP that's little understood and appreciated.
It's unclear how a Vonage phone with a Nola prefix would have behaved; it's
likely that it would be partly impacted, as the call would probably be
delivered at least to the LATA before Vonage would get a chance to move the
call to VoIP. A good strategy for using VoIP for disaster relief might be to
have terminations in many different LATAs ready before the disaster, and then
choose the terminations to use by which LATA is least affected.
Third, expedited service restoration in areas affected only
by wind damage (not saltwater flooding) was quite fast. However,
once you have service reconnected, don't count on it lasting.
As the linemen start repairing things properly, the hacks they put
in to make the EOC's phones work come back out. Basically, a
phone line restored on an expedited manner is an outage waiting to
happen as the restoration effort proceeds. No one wants it to work that
way, but practically speaking, it happens. Service restoration in
areas affected by flooding basically required the entire network,
to be rebuilt, because switch boxes corroded and overhead lines were ripped
to shreds. Mississippi, unlike Florida, does not get enough hurricanes
to force them to move all the telephone lines below ground.
You need to have lots of UPSes. You need both big ones for backbone
sites (1400 VA) and little ones (500 VA) for customer sites. The
purpose of the UPS is primarily to give the equipment very clean
power. Customers tolerate and understand small outages due to power
cuts. It is not necessary to build the network to operate with no
power at all, you just need to invest in UPSes to protect the equipment
to reduce support problems.
At customer sites, do not attempt to operate on generator power unless it
is a huge (75 kva) generator. The huge generators at the New Waveland Cafe and
Christian Life never gave us any problems, but the small generator we shared
with power tools at Morrell ate our UPS. UPSes with sophisticated power
monitoring and ground fault detection do not work well on generators.
It would be better to have a simple, stupid UPS than one that is trying
too hard to protect you from bad power. By definition, a generator
delivers bad power!
At backbone sites, you need reliable power. There is no one available to
put gas in the generator. We had this problem twice; initially at the Waveland
water tower and later at the Cisco satellite uplink. The lesson was to use your
resources to solve the power problem at the shared-infrastructure site before
making the network depend on it. The EOC can help expedite temporary power
poles to backbone sites (as Diamond Jim did for Port Bienville). Also, by
choosing your backbone sites wisely (like next door to a police station) you
can get reliable power faster without having to call in any favors: someone
else will already be expediting the power for their own reasons and you can
simply ride on their coattails.
Another possibility for reliable backbone power
might be solar. Engineering a solar power system would be feasible (it has
been done many times before in the wireless community). A solar power
system would need to be built and tested before it was deployed. It would
be important for such a system to be flexible, and not simply "we're powering
an AP with a solar panel!" At backbone sites, there are almost always
legitimate reasons for other power consumers than just the wireless gear.
For instance, you sometimes need a switch to connect multiple segments
of the network. And for long debugging sessions, you need a place to plug
in the laptop. Sometimes an idling car and an inverter can provide
laptop power, but physical limitations do not always permit using a car
for AC power.
Frequent site surveys while you are choosing backbone sites
might show that the power situation is improving on its own. We came
back from surveying Port Bienville and asked Diamond Jim to get temporary
power to the water tower. Several weeks later, we heard we had power and
should go install the link. When we got there, we found a temporary
cell site running off a generator, next to the temporary power pole.
While we were there, the generator maintenance man came, and offered to
let us plug in to the generator. We probably didn't need to wait
for the temporary AC power pole; we probably could have plugged into
the cell site generator the day it was installed, had we known about it.
In general, people's personal tool sets were sufficient to get the job
done. Wireless installs do not require much specialized equipment. The one
critical specialized tool is a cable crimper and a high quality cable tester.
It is easy to make bad CAT5 crimps, and to waste huge amounts of time debugging
them. It is better to insure that everyone who is making cables has a tester
and uses it religiously. This should be a requirement to join the team.
When running power over Ethernet, it is especially important that you have
100% continuity and no crossed wires, to prevent burning up equipment
or creating mysterious faults.
Ladders are required for almost every install. This is a problem, because
people who fly in and rent a car cannot bring a ladder as carry-on
luggage! You can borrow
ladders fairly often, but it becomes a problem if you don't have your own
sometimes. Having a couple ladders in the cache would be appropriate.
Having access to a lift truck is useful for certain installs. I don't
think it would be useful enough to justify the cost and trouble of
keeping one around. The Part 15 folks drove one from San Antonio all
the way to Gulfport. It couldn't go over 55 MPH, so it took forever to
make the trip. We found that it worked OK to get access to one when
we needed it by asking around for a favor.
At the Morrell Foundation install, we
explained that we thought we needed a lift truck to do the install right.
They used their resources (favors, cash... we don't know) to get a lift
truck on site for our use.
We did not have enough hardware for customer premises installs.
It takes a baffling assortment of poles, U-bolts, and brackets
(along with all the various nuts and bolts) to do a high-quality
installation. We were dead in the water until Don came along with
his incredibly well stocked van. Local hardware stores only reopened
in the final weeks that Don was on-site. Before that, Don's van
was the hardware store.
About half of our installations were in situations where it was appropriate to
use quick and dirty mounting techniques. For instance, when mounting a
subscriber unit on a tent pole, the mounting hardware it comes with is enough.
In other situations, we connected our device to a scavenged pipe from the
debris, and then used duct tape to attach the pole to a lamp post. The other
half of the installations called for more careful installs, for instance
putting an SU on the back of a church, or on the chimney above Second Street
Elementary.
When you are installing a subscriber unit in an area where the built-in
panel antenna is sufficient, it is much easier to mount it than when it
has an external antenna. It is much more common to use an external antenna
in the flat topography of the Gulf states, so having the antenna poles
and mounting hardware you need to use them is critical. A pole, plus the
antenna, plus the SU is a fairly heavy and bulky thing. You can't simply
prop it up somewhere and leave it, or it will blow down. This is where
having the equipment to do a professional installation becomes really
important.
As we were using donated CAT5 cable, we were limited in our choice:
the blue stuff, or the gray stuff (both of which was relatively
low quality interior-grade cable). We traced two failures back to
using interior-grade cable in a situation that called for exterior-grade
cable. One was on the Waveland water tower, where
the wire was crushed by another team working on a VHF repeater. The
other failure was due to burying the indoor-rated cable at the
Morrell Foundation.
One great mounting system we came up with at the EOC was to take a normal
antenna tripod meant to be fastened to a roof and fasten it to a T-shaped set
of boards instead. Then we could weigh down the T-shaped wood with sand bags,
making a very steady base that is easy to tip up and down to work on the
antenna. When you put the
whole assembly up on top of a flat roof, you get a significant amount of
height. Sandbags are easy to come by anywhere the National Guard has been.
They seem to leave them behind like cow droppings.
Rescue International had a crank-up tower to use with their VHF repeater.
We put some equipment on it, but found that it was not ideal. You can only
access the top of the tower when it is tipped down. That means that you can't
align the antenna when it is in the normal operating position. It is also very
slow to raise and lower, because it operates off of a single little electric
winch. Finally, for all the problems, it doesn't get you too much height. It's
clear having portable towers can be useful, but there's more research and
experimenting to be done as to what the most appropriate tower is for our
application. One feature that would be very nice for wireless work is the
ability to turn the top of the tower without climbing it or taking it down.
This would make it possible to align an antenna easily.
We did not put enough effort into public relations and community
outreach. The project's impact was reduced as a result.
We encouraged some of our customers to advertise our services,
but that didn't really happen, except at the Davis Store. I put some
effort into keeping a list of public Internet sites in the flier
that the EOC's press office put out, but not enough to make sure the
list was always complete and up to date. I regret that I didn't make
time for this type of work, but when you are a technical person
pulled into management, it is easy to focus on the technical work
your team is doing and help them, instead of identify the non-technical
work that needs to be done that they are not doing.
Another PR activity that we utterly failed on was getting the local
press to cover the story. That might have turned up local talent we
could have used to improve our transition plan when we had to go home.
I felt like I put the right amount of effort into writing the blog. It
served its purpose by both keeping donors up to date on our progress, informing
donors and volunteers of anticipated needs, and acting as a journal for the
team. Having someone assigned (or self-assigned, in my case) to this job was a
worthwhile investment.
As countless operations folks have commented over the
years, "this network would work fine if it weren't for the
users". So I discuss user support first, then network
operations issues unique to the disaster context.
First, a note on terminology. There was some confusion in the group
for a while about how to refer to the people using our network.
Terminology matters, because it sets the stage for the commitments
you'll be making. I made an effort to refer to our contacts at sites
who were hosting us as "customers". Part of the confusion came from
the fact that they were, in fact, beneficiaries. Our deal was to give them the
computers and the Internet service for free in exchange for
space, power, and security. I felt there was a slight undercurrent of
disrespect developing for the contribution our hosts were making to
the network by providing space, power, and most importantly users.
After all, when you are doing your best to provide a free service
and you are receiving complaints that it is too flaky, it is easy
to blame the beneficiary instead of treating it as a legitimate
customer service problem.
Once you perceive the problem as a customer service problem,
you've got more options available to you to address it.
The prime one, with a free service, is setting expectations.
In the disaster-response context, customers can accept outages, but they
need to know what quality of service to expect so that they can make plans for
alternatives if necessary. Obviously this is much more important when you are
acting as an ISP to another aid agency than when you are simply providing
public access. At most sites, however, we were acting as an ISP to the host
organization, which was in turn acting as a public access site. The lesson I
learned was that it was important to make commitments to the customers, but
the promises might be astonishingly vague and still be of use to the customer
for planning purposes. For instance, a commitment like, "we expect 24 hours
more outage, but then we think we can keep it going 8 hours out of 24 because
we are on a new uplink that's limited to evening hours" would be firm enough to
be useful to customers.
Having a phone number that can be easily redirected
to use as a NOC is important. We tried to do that when we
first published a NOC phone number, but through some
confusion ended up with a number we could not easily
move. The right way to handle it is to have a number dedicated
and pointed at a home base out of the area at the start of
the response. The cached equipment will be labeled with the
NOC number, and the team will be able to make new labels
on the fly with the number on it (even if the labels
are just a Sharpie on duct tape). Once the on-site crew
has finished the initial bring-up and is transitioning to
maintenance mode, the NOC number should be redirected to
a VoIP or cell phone on the ground, in the area of operations.
It is critical to give customers the shortest path to
the ops team on the ground, without introducing another
layer of human message passing, as we ended up doing.
Making the NOC phone number be a toll-free number would be
a nice touch, and it so happens that toll-free numbers are
easier to redirect to arbitrary points.
Network management will be predominately via customer
trouble reports. Tracking them with a structured system
wasn't necessary at the scale we were working on. Instead,
I simply acted as the point of contact for them and managed
the todo-list in my notebook, then assigned jobs to my coworkers.
This is one of many cases where we found record keeping with
pen and paper was the best approach.
The issue of where the uplink to the Internet is is a big deal.
It is relatively easy to run the distribution network. In our experience,
running a stable long haul link to Gulfport was much more difficult
than finding local satellite uplink bandwidth available for sharing.
As a result, the single biggest lesson I learned was to plan to
swap around uplinks. As the situation changes, the distribution
network will likely stay stable, but the uplink can change.
For instance you might start by using a fraction of the bandwidth
from a public information office, then later get the terrestrial
link working. When the terrestrial link fails, and a vendor like
Cisco offers satellite for a few days, you can switch back onto
that. When DSL
starts coming back, you can find a church with DSL and make them
the backup for your terrestrial link.
Technology to change the egress point of a network exists. The simplest
way to do it is to have each egress point use a precisely the same config
(for instance, "listen on 10.10.0.1, do NAT for 10.10/16"). Then
administrators manually arrange for only one egress point to be
active at once. When the egress point changes, all the existing NAT
bindings evaporate, but customers can initiate new ones by reloading
their browser page, or rebooting their VoIP phone. We used this
manual system in the Radio Response network.
During an outage in the
middle of the distribution network which created two separate
contiguous zones, north and south, we arranged for each segment
to have it's own egress point (one over satellite, one over the
terrestrial link to Gulfport). Multiple egress points gave us
a way to get the network up even in its partitioned state.
Technology to make switching the egress point of the network
a simpler thing, and more automatic would be welcome. The obvious
technology, dynamic routing (BGP, OSPF, etc) are not
welcome. They would significantly elevate the barrier to entry
to doing backbone maintenance. The current design only requires
the same level of knowledge used in a home-networking context.
By observation of the skills present in the Radio Response team
(which I think it representative of team members you might expect
to volunteer in the future) the lowest common denominator was
the ability to work in a home network context. Out of over 150
person-days of work, we only had about 20 person-days of work from
staff who would be able to work on a system with dynamic routing in
the core.
What's important in the "home networking" context? First, the
devices have to act like simple appliances. Configuration should be
via web user interfaces. If we are to have a dynamic routing
system, it must be able to work in the home networking context.
As there are none commercially available that I am aware of, we'd
need to implement something in preparation for our next deployment.
The best platform for a dynamic routing
system would probably be a re-flashed WRT54G, followed by
a Linux LiveCD. The nice thing about using Linux via a LiveCD is
that team members can bring the ISO file with them, or fetch it
via an EOC satellite connection, then burn the CD locally. Finally
they could load it into a donated PC and have a router ready to use.
Fetching, burning and running a LiveCD is practical in a disaster
context. Debugging BGP is not.
Auto-configuration systems would be welcome. We wasted a significant
amount of time with configuration errors. It is easy to make them
in the context we were working in, and it was exceptionally
difficult to find them and fix them. Many of our volunteers
were unfamiliar
with the normal behavior of the devices, so we had problems
telling the difference between "normal" bad behavior (i.e. non-critical
bugs), bad hardware (which a lot of people donated to us, accidentally
or otherwise), and our own configuration errors. The VoIP
devices were a particularly good example of this. The Uniden phones
from Nuvio were auto-configuring. Once they got an address assigned
via DHCP, they would fetch a file via TFTP and auto-configure themselves.
The Linksys ATAs we were using are probably capable of the same thing, but
we were configuring them by hand, and we made critical configuration
mistakes and spent time debugging them quite often. (However,
because the Nuvio phones prohibited setting the IP address statically,
they did not work once the Linksys routers exhibited a bug that
stopped them from doing DHCP.)
Immersed as I was in the details on the ground, I don't know too much about
what kind of "back office" support we were getting from Sharon, Mac's wife. In
a perfect world, we would have a formal offer and donation tracking system;
Part 15 apparently had something like this, but it lacked transparency, so it
seemed like a lot of offers were lost in the cracks. For instance, they never
replied to my offer to work for them, so I ended up working for Radio Response.
Another back office job would be to issue receipts and thank you notes.
Scheduling volunteers takes a lot of effort. It is not a wholly
back office thing, of course, since the needs are known only by
those on the ground. As the on-site manager, it was very difficult
for me to dedicate time to recruiting or even coordinating the arrival
of people who volunteered. This is something we should have done
better, but we didn't have a dedicated volunteer to assign to it.
We found out later that we should have been tracking volunteer hours.
Selfishly, tracking them would have been good for our own press,
so that we could show the amount of effort expended to assemble and
operate the network. But more important, Hancock County can use
our volunteer hours to help offset their part of the bill
FEMA will be sending them for the federal aid FEMA offered. As our
labor was highly skilled, each hour our volunteers logged could have
offset a larger amount of money than the churches bringing teenagers
down to muck out houses.
Beggars can't be choosers, and we are grateful for every single
donation we received. We did our best to make good use of the
equipment donated to us, keeping in mind our responsibility to
the donor to respect their trust in us.
We received substantial contributions of both computers
and network hardware. Both ended up presenting certain problems that
we had to solve.
The used computers that were donated to us were not in working order.
They were, in fact, often far from working order. The
same was true for the monitors.
It is unclear if the machines were broken during shipping, or
if they were donated as "machines in need of some refurbishment".
It's also easy to imagine that a unmarked stack of known-bad
machines were accidentally donated to us. It is common at large
commercial and academic sites to have computers around that
were not worth fixing because they failed near the end of their
planned service life. Such a pile would make a tempting donation
to someone who either didn't know they were broken, or thought
that donating broken hardware was better than donating nothing.
As a result of the hardware quality problems, we ended up setting up a
refurbishment operation in Ponchatoula, LA. We also found we needed to set up a
final testing lab in Bay St. Louis to weed out failures due to rough handling
between Ponchatoula and Bay St. Louis. The amount of volunteer effort the
refurbishment effort ate up was simply unbelievable. The operation in
Ponchatoula consumed at least 10 person days. The testing lab in Bay St. Louis
consumed an additional 3 to 5 person days. Working on PC refurbishment in the
middle of a disaster area is simply not a good use of resources. It is finicky
work that takes experience to do right. It is best done on large batches of
similar or identical machines, not the mish-mash that we had. It takes
a huge amount of space and benefits from special tools (hard drive
copiers, motherboard diagnostic systems, etc).
There are commercial and not-for-profit organizations dedicated to
recycling PC's, both by refurbishing them, and by recycling dead ones. Often,
they get paid by large organizations to take on the liability of a large
inventory of old machines, then they refurbish them, resell some,
and donate the rest to projects like ours. It would have been preferable
to work with a partner like that to handle the refurbishment task.
As our labor pool dwindled, I pushed the refurbishment work to the
"edge" by declining donations of hardware that was not ready to use.
It was a very difficult decision to make, but the results were satisfying:
it kept the team in Hancock County focused on operating the network,
and two significant donations of donated PC's still arrived. We passed
one donation on to the iCare Village, and the other on to St. Clare's
Catholic School, both locations where we had taken an active role in
delivering Internet service.
We received a mish-mash of used and new home networking hardware from
private donors. Because networking appliances are less complicated than a PC,
it was relatively easy to make use of these. However we did have a significant
problem with misplacing power bricks, as staff dug through the inventory
looking for pieces to solve whatever problem was at hand. This was a
frustration, but it's unclear that it's a solvable one; enthusiastic
volunteers probably are more valuable when they are allowed to dig through the
inventory than when they are held back by careful
inventory management.
One of Brent's many contributions to the project was several boxes of
1-gallon heavy-duty zip-lock bags. These allow you to save space by getting rid
of all the paper and cardboard packaging, and you can see what's in them
without opening them. Finally, you can handle the whole "unit" (router,
Ethernet cable, and power brick, for example) with one grab.
We also received a significant amount of inventory that was
seemingly new in original boxes. As we worked with it, however, it
became clear that in two cases, manufacturers elected to send us
refurbished stock. One of the manufacturers sent discontinued access
points which were very difficult to find manuals for. I need to emphasize
that we were grateful for the donations, but the fact we were not dealing
with current hardware made us less efficient.
We found that the infant mortality rate of the refurbished hardware was
noticeably higher than with the new hardware our team members were accustomed to
using while professionally building networks. That meant we had to be very
careful to test CPE's in the lab before setting off for a customer install, and
to always carry a spare in case the pre-configured device failed during the
install. We also had to visit sites to reset or replace devices that failed in
the days after they were installed. When you are mounting a device with a lift
truck that is only available one day, it must work right the first time;
there's no second chance to replace it. For this very reason, we ended up
"wasting" an antenna up in a tree, because it was connected to a dead access
point and we had no way to get the AP and antenna back down to repair them.
The radios arrived with a mix of firmware on them. This is common, even
with new hardware, but it formed one more hoop we had to jump through. Since
upgrading some of the firmware on these devices is a tedious and error-prone
activity, it often got skipped by team members who were in a hurry, or didn't
know how to do it. Running buggy firmware on some of the radios had an unknown
effect on the network, but it likely wasn't good.
Finally, one type of hardware that was donated to us lacked a
factory reset feature. First, several of the devices arrived pre-configured,
presumably because in the refurbishment lab the "re-flash NVRAM" step had
been missed. Those were useless to us. Second, in an environment
with high turnover of people with different levels of experience, a
factory reset feature is required. It is all too easy for someone
to set it to an incorrect IP address that the next guy can't guess,
or a password no one else knows, or for the label to wash off in the
rain, leaving us locked out of the device. We lost several devices
to mistakes like this.
In this section, I lay out a program that would address a number
of the problems I saw that made us inefficient.
We need to invest in a certain amount of preparedness. We need
to prepare our network design, the equipment, and our team.
The network design we ended up with in Hancock County would likely work in
other contexts, in particular the structure of the distribution network, and
the numbering system we used towards the end. One particularly nice feature of
it is that you could pre-configure many of the components and label them, then
assemble them into a working network with minimal configuration work.
Gathering equipment for the cache will be a two-step job. First,
we need to decide on the future of the network in Hancock County.
If some of it will be recovered, it (and the leftovers from the original
install) can form the core of the cache. Next, we need to decide on
the inventory for the final cache (how much point to point hardware,
how much distribution hardware, how many customer sites). Whatever
is missing between what we have now and what we want to have, we'll
have to get from donors. When acquiring equipment, we should do our best
to avoid getting refurbished equipment again. If a manufacturer
wants to offer a discount on new merchandise, that would be really
helpful. But the cache needs to be made up of the exact same equipment
that's available at retail, not refurbished merchandise, and not
factory returns.
The equipment in the cache should be opened, tested, and pre-configured.
It should be clearly labeled, including contact information that will
be correct during a deployment. That means whatever phone number is on
the labels must be redirectable. Labels should indicate how the equipment,
in its preconfigured state, fits into the stock network design.
The cache should also have CD's for Windows, and for Linux.
For Linux, it would be really nice to have a customized installation
that creates a ready to use, "on site administration server".
It could include MRTG and Nagios, a wiki (and a wiki-syncing system
that publishes the locally maintained information out to the public
network), a Samba server for file sharing, an issue tracking system,
and an volunteer tracking system.
After assembling the cache, we should test our equipment and our approach
by running at least one drill. The leader of the drill should also be prepared
to commit to responding when the group is activated and will act as the project
manager.
A drill could take place on a weekend. The team members would
travel to the equipment cache location (Mac's farm in Rayville
would be an ideal spot). The team would be given
a scenario Saturday morning, and engage in a surveying, mapping,
and planning exercise Saturday morning. By lunch, the team should
have a plan in place that addresses the problem posed by the
scenario. The team should practice some data management techniques
at this point, perhaps using a local wiki to document the plan,
using offline mapping software, etc. Also, during the planning
stage, they should not use the Internet, to simulate the
disaster scenario, until they find a satellite uplink to use.
Next the team
will choose a subset of the plan to implement on Saturday afternoon
and Sunday morning. Sunday afternoon would be dedicated to cleanup.
The subset of the plan to be implemented
should include, at the minimum, the following
things:
Motivating team members to invest their time
in such a drill would be difficult, especially for those who would
incur significant expenses while traveling to the drill. It seems
likely that only residents of the Gulf States would be able
to make it, but that's probably as it should be. It makes sense
to build this disaster response capacity in the region where it will
be most useful.
The bottom line is this: our approach worked, and it worked in
a very difficult situation. With some preparedness, we could be
much more effective, requiring fewer volunteers to make the same impact,
and doing so in a more timely way.
As I've written up this report, I have identified projects that we
should undertake as we move forward. They are listed here in
no particular order.
Learn how the equipment caches for command radio systems work.
The federal government maintains a cache in Idaho of ready to use
radio systems. One was in use on the Waveland water tower,
and it came from
the National Interagency Fire Center
which maintains a
radio cache.
One difficulty with operating a cache with computers in it
is keeping them up to date and operating correctly. With network
hardware, it would probably be enough to upgrade all the devices
to a standard version, then store them. Computers need to be updated
to the latest patch level as soon as possible after putting them
into use. They should be stored configured to enable no public services
upon boot, so that the patching step can be done before the system
is compromised.
Aleks Clark made one, but it was not successfully implemented,
by which I mean it fell into disuse as soon as Aleks stopped
running it.
An effective system would need disconnected operation. When you
are building the Internet, an Internet based system doesn't help
you any. It needs to be able to recover from loss due to flaky
donation hardware, and move easily because a volunteer is ready to go home
has it on his laptop.
Home-networking class routers are cheap and easy to build into
complex systems, but they have incredibly bad software. They leak
resources, they mysteriously stop doing things they were doing
just fine yesterday, etc. They are essentially useless in a network
unless they can be automatically reset on a regular schedule.
Someone needs to make a smart power brick that power cycles the
router every 12 to 24 hours.
Build a custom image for Linksys devices based on the OpenWRT toolset.
This would make customer routers more manageable. Could enable auto
configuration systems. Also could allow ping on the WAN interface,
which would improve network manageability.
We found that time and again we were offered the use of bandwidth
(satellite, DSL, and the T3 (later T1) in Gulfport). We were not in
a position to make quick use of these offers. By engineering the system
ahead of time to expect those offers, we could be ready to accept them.
One challenge is that usually the offers of IP uplink come with
string attached; it's not really an offer of IP transit, but an offer to
plug in to their network with your laptop. That means to take advantage
of the network connection, you'll need to DHCP to get yourself an
IP, then you'll need to somehow live with the fact that you are
NAT'ed, and maybe even firewalled such that only HTTP works (and
maybe only via an HTTP proxy).
Instead of envisioning our network as a distribution network
downstream of a single NAT box, instead envision the network as
a set of one or more distribution networks connecting in to a
remote "stable NAT environment" via VPNs. Every time someone offers
us some bandwidth, we'll bring a PC over, load a Linux LiveCD
with our software on it, and boot it. The PC has two interfaces,
an inside and an outside. The outside interface DHCP's for an address
and then opens a TCP channel to the VPN server in the stable
NAT environment, which is hosted in a managed datacenter far
outside the disaster area. It measures the quality of the link
back to the NAT server (if it can make the link at all).
The inside interface runs VRRP
and asks around on the network segment it is plugged into if there is
anyone out there that can provide a better link to the Internet
than this server can. If so, then it waits as a standby.
If not, then it takes control of the router address, and all
the customer routers out there that had been failing to reach
their default router can now talk to the Internet via this
new connection.
Later, when we get the long haul link up to a T1, the lower
latency on that connection gives it a higher quality link measurement
and it advertises a better link. The box that last had the link
goes back to standby, and the high quality link is now the
master link outbound.
Because NAT for the entire network is happening in the stable
NAT environment, beyond the VPN connections, the customers never
even see their connections get broken. The other nice thing about
implementing NAT in the stable part of the network is that it will
be ready and pre-tested before the deployment. People on the ground
will only need to install pre-configured devices from the cache, and
run uplink sharing boxes at the edges of the net.
No one has attempted to build a system like this yet. We don't know
if it would work, and if so, how much standard software would
be involved, and how much custom (and thus buggy, difficult to
maintain) software would be required.
As icing on the cake, it would be neat to build bandwidth throttling
into this proposed box. So, when we are offered bandwidth
for our personal administrative use, we can say, "how about if
we share it with the whole network". they will undoubtedly say, "no,
just your laptop", then we can say, "how about if we promise it will
never use more than 20% of your bandwidth?" That would be a very valuable
tool to have during negotiations. And as decorations on the icing
on the cake, you could have the bandwidth throttling controlled by
cron, so that after work hours, we get 90% of their bandwidth.
History of the Project
Genesis
Hancock County project (and my role)
Hancock County after the Hand Off
The future
Achievements
Customers
How the Network was Used
Timing affects usage patterns
Relief worker usage
Storm-specific uses
Internet Lab Use
Telephone use
Open Network, Unknown Users and Uses
Network Design
Initial Design
The second design
Lessons learned
Applicability of our approach
Applicability of this technology
Planning
Information Management
Getting Intel
What Data?
Sharing Intel
A proposal
Uplinks and Timing
Cooperation with Other Organizations
County Government
Incident Command System
Private volunteer organizations
Volunteer SAR teams, visiting Fire Fighters
Electric Companies
Power structure inside the team
Things we should have had
Naming difficulties
Telco Resiliency (or lack thereof)
Electrical Power
Tools and Ladders
Physical Installation Issues
Public Relations
Operating the Network
Customer Service
Network Operations
Back Office Needs
Donated Equipment Woes
Used Computers
Networking hardware
A Vision for Success
Future Work
Caches
Build/find an equipment and volunteer tracking system
Watchdog for home-networking routers
OpenWRT based image
Bandwidth sharing box (and auto uplink selection)
Versions of this document
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.