Marine Data Literacy 2.0

Providing instruction for managing, converting, analyzing and displaying oceanographic station data, marine meteorological data, GIS-compatible marine and coastal data or model simulations, and mapped remote sensing imagery





Home > 1. PC Basics > 1.12 Random Subsample

1.12 Randomize and/or Subsample Table Data in Excel

  • Exercise Title:  Randomize and/or Subsample Table Data in Excel

  • Abstract:  Very straightforward method to take a data table (for example an XYZ export from a grid) and cut it down to any desired fraction of the original points, while at the same time randomizing the record order.

  • Preliminary Reading (in OceanTeacher, unless otherwise indicated):

    • N/A

  • Required Software:

    • Excel - Or any other spreadsheet with =RAND function

  • Other Resources: 

    • - Typical tab-separated data in X, Y, Z format

  • Author:  Murray Brown

  • Version:  3-28-2014

1.  Download the above zipfile to DATA > OCEAN >  and unzip it there.

NOTE:  This data table is tab-separated.  For you own data, if they are separated by commas or spaces, it would be best to convert them right now to tab-separated, before going any further.  Use a good ASCII editor.

1.  You don't need to make this figure, but it is the contents of the XYZ table file above, imported into Saga and converted to a point shape.  The Z value is the temperature at 400 m.

These data have not been gridded; the points are so dense that it appears to be gridded.


2.  This is a view of the data table in Excel, showing the data values.  The points are extremely close and obviously in perfect rows (columns also!).
3.  Add a new column with the name RAND_INDX.  In the first cell place the formula =RAND().  Hit enter to see the first random index value.
4.  Double-click on the tiny square in the lower-right corner of the first value cell in the RAND_INDX column, and random numbers will appear in all the cells below it.

You cannot sort on this field, because every time you make any change on the sheet, a completely new set of random numbers appears.

5.  To sort on the random numbers, we must "freeze" them in another new column and use that column for sorting.

Block the RAND_INDX, and select COPY.

6.  Right-click on the next empty column, and select PASTE OPTIONS > PASTE VALUE (the second icon, with the little numbers).
7.  This freezes the random index values into the fifth column.  The fourth column probably has already changed again, but you don't need it now.
8.  Change the name of the 5th column to STATIC_INDX to indicate it will not change.
9.  Block the static column and select SORT & FILTER > SORT SMALLEST TO LARGEST.

Excel will ask what you see here; answer EXPAND THE SELECTION and click SORT.

10.  Now the whole table is sorted by the STATIC_INDX values, as you see here.
11.  If you want to see values in standard format, block the column and select FORMAT CELLS > NUMBER to see this window.  Select 9 decimal places, and click OK.
12.  Now you can see the index values better at the lower end of the list.
13.  Scroll to the bottom of the table to see the largest values.  Note that there are 84,189 points total.
14.  Now, if you want to subsample the data, you can do that by simple arithmetic counting of the rows, because they are already random.

If, for example, you want 10% of the points, then:

  • Block the lines down to line 8419 (10% of 84189)
  • Ignore the 4th and 5th columns
  • Block the desired columns
  • Select COPY
15.  Saga cannot read Excel files, so open any good ASCII editor, and use PASTE to load the 10% section from above.

Here you see the data in Context.

16.  You should SAVE the above table, with an appropriate filename in a logical folder.
17.  Just for your interest, here is the product table, converted to points in Saga and displayed on exactly the same sort of map as Panel 1 above.  You can easily see the randomization and the subsetting.
18.  This exercise is not for common use, because we don't often want to reduce data when we have it.  But for teaching purposes it is useful to take regular grids and make them appear to be data.  If you are teaching at some point in the future, just remember where you saw this method.