Lesson goal: Data Science: Messy geography of the U.S.

Previous: Some challenges with CO2 concentration | Home | Next: Work with student grades

Here's a data science exercise with some very raw data. Geographical data of the United States (adapted from Turbo Prolog's cira 1990 "GeoBase" code) is in a 700-line CSV file, called geo.csv. Lines can have a different number of columns, and entities on a given line can be numbers or strings. And, unlike sample.dat and co2.csv from the other exercises, this file is too big to be processed manually (i.e. "by eye" or "by hand").

The clue to what's in a line is by the string in the very first column, which can be state, city, river, border, highlow, moutain, road, or lake. If this identifier is:

  • state, then the following columns will be: name, abbreviation, capital, area, admission-rank, population, city1, city2, city3, city4
  • city, state-it's-in, abbreviation, name, population
  • river, name, length, states-the-river-runs-through
  • border, state, abbreviation, states-that-share-the-border
  • highlow, state, abbreviation, highest-point, height, lowest-point, height
  • mountain, state-it's in, abbreviation, name, height
  • lake, name, area, states-the-lake-is-in
  • road, number, states-the-road-passes-through

There are many questions one might wonder about this data. Here are some you might try:

  • What are names and abbreviations of all states in the U.S.?
  • What states start with a 'C'?
  • What cities are in California?
  • What is the biggest city in the U.S.?
  • What is the longest river in the U.S.?
  • Which rivers are longer than 1,000 kilometers?
  • What is the name of the state with the lowest point in it?
  • Which states border Alabama?
  • Which rivers do not run through Texas?
Code that will answer (some of) these are in the examples.

Now you try. Write code to answer some of our proposed questions, or even to just "explore" the data.

Type your code here:

See your results here:

Some of these will be helpful to you:
  • Confused about the data? If so, that's typical in data science! Take a look at it by reading it in and printing it to the screen.

  • Since all fields of the lines in this file are not numbers (they're a mix of numbers and strings), do not use explode to break them apart. Use explode_str to break them into an array of strings.

  • If you need to convert a string to a number, use the tonumber() function.

  • strings = explode_str("separator","string") -- explodes string into an array of strings.

  • nums = explode("separator",string) -- explodes string into an array of numbers.

  • If arr is an array, then #arr will tell you how many elements are in the array.

  • Arrays in Lua always start at 1 (not 0).

  • substr(s,start,len) -- returns the substring of string s, starting at character start, for len characters after that. The first character in a string is at position 1 (not 0).
Show a friend, family member, or teacher what you've done!

Here is a share link to your code:

Does your code work? Want to run it on your iPhone?

Here's your code:

  1. Use [Control]-[C] (Windows) or [⌘]-[C] (MacOS) to copy your code.

  2. Paste it using [Control]-[V] (Windows) or [⌘]-[V] (MacOS) into this page

  3. Then click the "Use on iPhone" button that you'll see.