Here's a data science exercise with some very raw data. Geographical data of the United States (adapted from Turbo Prolog's cira 1990 "GeoBase" code) is in a 700-line CSV file, called geo.csv. Lines can have a different number of columns, and entities on a given line can be numbers or strings. And, unlike sample.dat and co2.csv from the other exercises, this file is too big to be processed manually (i.e. "by eye" or "by hand").
The clue to what's in a line is by the string in the very first column, which can be state, city, river, border, highlow, moutain, road, or lake. If this identifier is:
state, then the following columns will be: name, abbreviation, capital, area, admission-rank, population, city1, city2, city3, city4
city, state-it's-in, abbreviation, name, population
river, name, length, states-the-river-runs-through