Analyze Your Google Location History: Exploring Data

In the previous post, we took the data that we exported from Google Maps Timeline processed it and uploaded it to Google BigQuery for analysis.

Today we are going to do an exploratory data analysis of the information.

The example above is an example of the JSON information that we get from Google. I should mention that I’m not an expert in GPS information, so I’ve tried to do some research on all of this. If you find something wrong, you can open a pull request and help me fix it.

As you can see from the file above we have a lot of rich metadata here that we can use.

Value	Meaning
heading	The direction the device is traveling.
activity.type	Here, activity could refer to multiple values. My guess is that Google is using some machine learning magic to infer what the user is potentially doing. There are many possible values.
activity.confidence	Here, Google is assigning a confidence interval to your activity type. The values go from low to high, 0 - 100.
activity.timestampMs	This is the timestamp in milliseconds for the recorded activity.
verticalAccuracy	This could refer to the accuracy of the vertical location of the device.
velocity	This could refer to the speed of the device at capture time. It’s probably inferred based on other data points.
accuracy	Accuracy is Google’s estimate of how accurate the data is. An accuracy of less than 800 is high and more than 5000 is low.
longitudeE7	This is the longitudinal value of the observation.
latitudeE7	This is the latitudinal value of the observation.
altitude	This could refer to the altitude of the device. I’m assuming it’s measured from sea level.
timestampMs	This is the timestamp in milliseconds that the observation was recorded.

For the main values that we’ll be working with: timestampMs, longitudeE7 and latitudeE7. These values are not in great “human readable” format, but BigQuery can help us fix that!

In BigQuery, we can convert this timestampMs 1486800415000 to 2017-02-11 08:06:55.000 UTC using MSEC_TO_TIMESTAMP().

We can also easily convert latitudeE7 and longitudeE7 by dividing by 1e7. 481265044/1e7 becomes 48.1265044 and 116593258/1e7 is 11.6593258 giving us the coordinates 48.1265044, 11.6593258 which is 48°07'35.4"N 11°39'33.6"E.

If you want to read more about latitude and longitude, check out Understanding Latitude and Longitude.

Also in the example data above looking at the activity for this observation Google thinks it’s 75% confident that I’m in a vehicle going somewhere.

Exploratory Data Analysis

Now that we have looked at the data available in the JSON file, let’s write some SQL and pull all of the information into R and take a look at what’s going on.

For this analysis, I’ll be analyzing the following:

values	type	definition
latitude	FLOAT	The latitude of the observation
longitude	FLOAT	The longitude of the observation
date	TIMESTAMP	The date, converted from `timestampMs`
accuracy	INTEGER	The accuracy of the observation
accuracyLevel	STRING	The inferred accuracy level of the observation
minuteDifference	FLOAT	The difference in minutes between current and previous observation
activityType	STRING	The type of activity of the observation
activityConfidence	INTEGER	The confidence level of the activity
latLong	STRING	The concatenated values of Latitude and Longitude
cityLatLong	STRING	Values of the lat/lon used to guess the current city

The table above is generated by running the Legacy SQL code below in BigQuery.

There are a few ways you can handle this. In my R code, I’ve written the query to pull the results directly into a local data frame. You can also run the query and save the results to a temp table, to help reduce calls and save yourself a little bit of money. It’s personal preference.

Overall

Let’s begin our analysis by looking at some simple questions:

How many observations do we have? We have 1,770,882 recorded observations.
What is the minimum recorded time? 2011-07-24 03:59:37 UTC
What is the maximum recorded time? 2018-03-21 11:46:00 UTC
What is the median time difference between observations? 60 seconds
How many different coordinates were recorded? 418,890

Wow, so ~1.7mil observations between 2011 and 2018. That’s a lot considering that observations are sent every 60 seconds! There are 10,081 observations recorded 26 seconds apart. That might be worth investigating more.

Another thing that I’ve observed is that since observations depend on cell towers, satellites and WIFI they can sometimes be inaccurate even when you’re standing still.

Take for example these two locations: 40.555653, -105.098351 and 40.555608, -105.098397.

Even though my phone may be sitting quietly in a locker sometimes the distance recorded between observations may be 20 feet apart. It’s hard to tell without visualizing it, so I’m not going to worry about that for now.

Accuracy

How accurate are the observations? 95% of all observations are rated as HIGH with less than 0.01% with an accuray of LOW.

accuracyLevel	n
high	1696926
medium	73637
low	319

Activities

activityType	count	medianConfidence
still	549051	100
tilting	65220	100
unknown	35531	75
in vehicle	33277	85
on foot	15554	80
exiting vehicle	964	100
on bicycle	866	75

As you can see, we seven different types of activity, excluding NA’s. It would be interesting to try to figure out what’s going on with the unknown and NA.

Lets visualize the distribution by activity type. All of the observations are left skewed so we are going to visualize this using log.

Here’s the R code if you would like to follow along:

Chip Oglesby

Data Engineer in Canton, NC | Taking it one day at a time

Analyze Your Google Location History: Exploring Data

Exploratory Data Analysis

Overall

Accuracy

Activities