Bellabeats SQL/Excel Capstone project

How can a wellness company play it smart

Google Data analytics professional certificate case study 2.

Bellabeats Case Study.

Bellabeats is a high-tech manufacturer of health focussed products for women. They are a small successful company but are looking to become a larger player in the global smart device market. As a junior data analyst, I have been tasked with analysing data from fitness devices and applying the findings to one of Bellabeat’s products with the aim of providing insights that the company can apply to its marketing strategy for that device.

1. Business Task summary

Bellabeats CEO Urška Sršen is looking at ways to increase growth and revenue. To do this she has tasked me with analysing data from non-bellabeat products to see how users are utilising their smart fitness devices. By identifying usage and trends for these products and applying the findings to a Bellabeat product a marketing campaign can then be devised. The company uses a variety of advertising strategies to market their products so this will need to be considered when applying the findings of this study.

The questions being asked therefore are:

How do non Bellabeat users utilise their fitness devices?
How does this relate/apply to Bellabeats products and consumers?
How can these findings be used to influence a future Bellabeats marketing strategy?

2. Description of data being used

The data made available for use in this analysis is a public dataset made available through Mobius. The Kaggle dataset contains information pertaining to 30 different Fitbit users that agreed to share their personal tracker data. The data tracked includes 18 different spreadsheets each logging a different activity or fitness aspect including, daily activity, calories, exercise intensity, steps, hourly calories, intensities, steps as well as minute data for the same datasets. There are also datasets tracking minutes slept and daily overall sleep. Interestingly the dataset contains 33 not 30 total participants. It is possible that several participants updated their devices and so new devices were logged or that an additional 3 participants were added to the study.

Using the ROCC framework I analysed the datasets looking at the reliability of the data, the originality of the data, how comprehensive it was, how current it was and how cited it was. The dataset’s reliability is questionable given the low participant count of 33 individuals. This could allow for bias as more active users may have been more likely to respond to the survey than less active users. This could result in a skew in the results that may lead to a flawed marketing strategy. Additionally, it is not stated if the data was taken before or after the survey was completed. If the participants knew that their data was going to be used in a survey it may have led to them altering their behavioural habits again resulting in a skewed dataset. Finally in terms of reliability the data was taken from a variety of Fitbit devices and not from a single model. This may have a direct impact on the data collected and the induvial data may be more or less accurate based on the device used.

The data is however original. IT has been taken directly from Fitbit devices used. The dataset represents 30 individual’s (although 33 individual user IDs are present in the dataset) Fitbit data and each variable measured was taken directly from the device being followed. It terms of scope and coverage the dataset covers many of the commonly tracked variables that would be expected in data of this kind and should allow for a comprehensive analysis that would allow Bellabeats to draw conclusions about fitness tracker uses. As mentioned previously though it is a small sample size that should be added to, if possible, with more datasets.

A further drawback of this dataset is that it is from 2016 meaning that it could be out of data. Both trends in use and activities tracked may have significantly changed as Fitbits and other smart fitness devices have become more sophisticated and accurate in the time since this data was collected. It would be prudent to collect a more recent dataset to add to this one which would help to alleviate any biases and misinterpretations made due to the size and age of the data.

Since being updated the dataset has been upvoted over 2000 times on Kaggle in 3 years. It has been used for multiple data analysis projects and is well cited. This gives me confidence that the data will provide good base from which to analyse fitness habits and trends in people using smart fitness devices.

3. Documentation of any cleaning or manipulation of data

The first task was to ensure that the data was cleaned and ready for analysis. The dataset consists of 18 different CSV files. However not all of these are going to be useful in the analysis. As the task is to identify trends and patterns of behaviour in smart fitness device users it would be advisable to use a top-down approach in the data analysis process. Starting with daily data and moving onto hourly data. The dataset also contains data down to the minute and seconds of the day however these datasets will be excluded from this analysis as they will not be useful in identifying broad trends in use of the smart devices. Additionally, the daily activities spreadsheet contains all the data from the other daily CSV files and therefore they are not required in the analysis. They can however be used to confirm that the correct number of entries are present in the daily activities CSV file.

Firstly, I wanted to explore the daily datasets to see how users use their devices on a daily basis. The dailyActiviy_merged.csv was uploaded to bigQuery and the following SQL quires ran to determine how any distinct Id’s there are in this dataset and if there are any duplicates.

The number of distinct Id values in the daily_activities.csv

Output:

To determine the number of duplicates in the data

Output:

This shows that there are 33 distinct users included in the dataset 3 more than was expected and that there are no duplicate rows included in this file.

As the data was collected over the period of a month the dataset was examined to ensure that only values set within the timeframe of the study were present.

Table 1: Export form SQL showing dates in ascending order present in dailyActivities_merged.csv

As can be seen in table 1 there are 31 dates present in the dataset ranging from the 12APR2016 to the 12May2016. This means that the data is all collected from the same period for all users with outliers present.

With the date range and Id count confirmed the data was examined to ensure that there were no null values in any of the columns. The following piece of code was used to perform this analysis with the column name altered for each run.

Output: For all columns was as shown below.

This confirmed that all columns were complete containing no null values.

This dataset contains data for all 33 users on each of the 31 dates recorded. Whilst this woll provide a starting point for the analysis of user’s daily habits further information is required. Mainly there is no weekday with which to gain an insight into user’s weekly habits. Additionally averaging the data for each of the user’s will help the analysis for each of the datapoints measured. As such it two new tables were created to reflect the needs of the analysis.

Firstly, a weekday column was added to the data. The following code was run in order to create the table dailyactivities_weekday.

Secondly, all of the values for each column in the original dataset was averaged creating a new table with the average values for each user.

The other daily data present in the dataset was the dailySleepday_merged.csv. This contains records of the number of sleeps each user had, for how long and the time spent in bed. As with the dailyactivity dataset this table was analysed for the distinct number of users as well as analysed to see if there was any duplicates present in the data.

As can be seen from the results above there are only 24 user Ids present in this dataset. Additionally, there are three duplicates present in the dataset that need to be removed prior to any analysis being carried out. This was done using the following code to create a new table with the duplicates removed.

To check that this had worked the new table was checked for duplicates as before with the following output. The table was also inspected in BigQuery to confirm that the new table had three less rows than the previous version (410 rows compared to 413).

The dataset was then checked to ensure that there were no null values in the table. This was done as before with the daily activity dataset with all iterations of the code returning no data meaning that no null values are present in the dataset.

In order to get the most out of this dataset it was decided that it should be joined to the dailyactivity_weekday table with only the dates present in the sleepday table being used in the new table. In order to do this the SleepDay column was changed to match the ActivityDate column name in the daily activity table using the following code.

The two tables were then joined using the following code, with the data ordered by Id and ActivityDate

Data cleaning for hourly datasets.

Following the cleaning and validation of the daily datasets the hourly datasets were uploaded to BigQuery for similar analysis. The first part of the cleaning of the data was however carried out in Excel. The Activitydate/hour column in the hourly datasets contained two different formats for the datetime, this had to be changed to all be consistent prior to uploading the dataset. The first step carried out was to change all datetimes into the 24hr format. This was done using find and replace all for each PM hour in the incorrect format and for the 12AM data. Once this was done the data was split using the text to columns function splitting the cells into date, time and AM/PM columns. The AM/PM column was removed from the data as it was not required with the time now in 24hr format. Finally the ActivityHour column was renamed ActivityDate to bring it in line with the other spreadsheets and the new Hour column was named Hour. Both were formatted in date and time respectively prior to being uploaded. This was done for the HourlyStep, HourlyIntensities and HourlyCalories spreadsheets.

As with the daily datasets the 3 tables were checked for to ensure the correct number of Id’s present and that there were no duplicates in the datasets. The code used is shown below.

The output for all three functions was the same with 33 Id’s present in each dataset, the same number as was present in the daily activities dataset. In order to check if there

This query returned no results meaning that there is a maximum number of 24 results for each activity day for each Id. Next the data was checked to see if any data was missing from the dataset by changing the above code to <24. The results are shown in table 2 as a combined table for all three datasets.

Table 2: Combined table showing missing values in HourlyCalories, HourlyIntensities and HourlySteps

All three dataset are missing the same datapoints for the same Id’s and dates. This was further confirmed by checking that there were no null values present in any of the datasets. Using variations of the code below for each dataset (code shown is for Hourly intensities table).

For all iterations of this code no values were returned meaning that there are no null values or duplicates present in these datasets although they are missing some values. However as the values missing are all the same across the datasets then it will be fine to continue the analysis and join all of these tables together based on Id and ActivityDate ready for analysis to begin.

Finally, the weekday for each date was extracted from the ActivityDate column, using the following code.

With this done the data is now in a format that can be used to perform analysis to identify trends and patterns in the use of smart fitness devices which can then be used to inform future business practices for Bellabeats.

4.1 Analysing Key metrics in the Weekday dataset.

Firstly, the average daily activities for each individual user were analysed to see if there were any differences in use between different users. In order to do this box and whisker plots were created to identify any outliers in the data set. Upon generation of the plots, it was discovered that although there is calorie data for some users there are no steps or distance values. Upon closer inspection for many of the days with a value of 0 for steps and distance the calories burned values were the same despite being on different days. As such it was decided that any data with a value of 0 for steps and distance would be removed from the analysis. Additionally, User 1624580081 had one day where they walked over 36000 steps (marked with red circle figure 1(B)), as this is a clear anomaly in this dataset it was decided that this also would be removed from future analysis.

Figure 1: Box and whisker plots for (A) Calories burned, (B) Total Distance and (C) Total steps by User ID.

The data in figure 1 shows that there is a significant variation between users in terms of the number of calories they burn, total steps and total distance. However, when comparing for each individual user the number of calories burned directly correlates to the total steps and distance. This is perhaps unsurprising as most fitness devices user steps and distance covered to calculate calories burned.

The next set of data that was examined was the total average calories, Total steps and Total distances for all users. The analysis for this is shown in figure 2. Pivot tables were set up in Excel using the filtered data from figure 1 thus removing any values that may skew the data.

Figure 2: Average Calories burned (A), Steps (B) and Distance (C) per user.

As can be seen in figure 2 the number of steps a person records always directly correlates to the distance travelled. However, the number of calories a user burns does not always correspond to the number of steps/distances travelled. It is possible therefore that some users are using their device to track additional activities

To investigate this the average total distance was compared to the average tracker distance for each user. As can be seen in figure 3, there is no difference between these two sets of values. For every user the average distance is the same as the average tracker distance.

Figure 3: average distance compared to the average tracker distance per user

To see if certain users were also logging activities that could account for the differences observed in figure 2 the average distance was compared to the average logged activity distance for each user. This can be seen in figure 4. However, only 3 users had logged activity distance data within the dataset and of these three users only user 8378563200 displays a significant difference when comparing the graphs shown in figure2. Therefore, the differences observed in figure 2 may have to be down to the variation caused as a result of users utilizing different smart devices.

Figure 4: Average distance compared to the average logged activity distance per user.

The final piece of analysis in this section was to compare the differences in terms of movement classification per user. The dataset tracks four different classes of movement, these being sedentary, lightly, moderately/fairly, and very active. To see how these compare the average time spent in each class for all users combined was plotted as shown in figure 5. Over 81% of the time monitored was spent in the sedentary class whilst the fairly active class had the lowest average time spent in it. Whilst not shown here this was further investigated to see if this graphic was true for all users. The was the case with the lowest time spent in either the fairly/moderately or very active for all users and the largest average time spent in the sedentary class.

Figure 5: Average time spent for all users in different movement classes.

4.2 How do patterns of use change depending on the day of the week.

The next part of the analysis focused on how key metrics and user behaviour changed over different days of the week. This was started by looking at the average number of calories burned per weekday. As can be seen in figure 6 the day with the highest average calories burned is Tuesday with Saturday being the next highest by a small margin (2440 and 2428). The day with lowest average calories burned was Thursday. When looking at the same data for steps, the same two days Tuesday and Saturday still have the highest average step count however it is Sunday that has the lowest average step count. The data suggests that users are burning more calories early in the week and on a Saturday rather than later in the week or a Sunday.

The next part of this analysis was to compare how active users were by Weekday. Figure seven shows this data. As shown in figure 5 most users spend the majority of time either sedentary or lightly active. However, this data does not correlate well with any other data presented. Thursday for instance has the lowest average time spent in sedentary activity level whilst also having the lowest calories burned and step counts. This would be expected to be the other way around. The graphs for moderately active and very active minutes do however correspond to the other data shown here suggesting that these are the days when people are more likely to exercise (Monday, Tuesday and Satrurday). This should be taken with care however as the differences between the weekdays can be explained with the variation between users. More data is required to confirm the findings presented here.

Figure 6: Average calories burned per weekday (A), Average step count per weekday (B).

Figure 7: Different activity levels by weekday. Sedentary (Top Left), Lightly Active (Top Right), Moderately active (bottom left) very active (Bottom right).

4.3 Does the amount of sleep correlate to activity levels.

For this section of the analysis, it was speculated that sleep may correlate to activity levels and that users may track their sleep in order to boost performance. The sleep_joined dataset was exported from BigQuery for that analysis to be carried out. The data presented therefore represents a smaller sample size than the dataset above. Firstly, The average time asleep was compared to the total time in bed to see if there were any differences observed within this data set. As can be seen in figure 8, except for 1 user (184430081) there was very little difference between the time a user was in bed compared to the amount of sleep they averaged.

Figure 8: Average minutes in bed vs the average minutes asleep per user.

Next the average time asleep per weekday was investigated to see if the days with more sleep correlated to the days that people exercised. Figure 9 shows that Sunday had the highest average minutes of sleep with Thursday having the lowest. This does not correlate with the data shown in figures 6 and 7 which showed that both Sunday and Wednesday had some of the lowest activity levels across different weekdays.

Figure 9: Average time asleep per weekday.

The final investigation that was carried out on this dataset was to compare the average total minutes sleep with the varying levels of activity present within this dataset. Figure 10 shows this with Very active minutes (blue line) and fairly/moderately actives (green line) plotted against the average minutes asleep per weekday. There appears to be no correlation between minutes asleep and activity levels with high levels of activity being found on days with the lowest average minutes asleep.

Figure 10: Average minutes asleep compared to the average time spent very active (blue line) and the average time spent fairly active (green line)

4.4 Which hours have the highest average calories burned and step totals.

The final analysis that was carried out for this project was to investigate during which hours people were the most active. As days of the week could alter the trends traditional workdays (Mon-Fri) were plotted separately to weekends (Sat-Sun). On workdays the highest average calories burned, and steps counted occur from 4pm to 8pm with another spike at 12pm. This suggest that people are exercising following work. This changes on a weekend with the highest calories burned being observed between 9am-3pm with another spike seen between 5pm-7pm. From a logical standpoint this makes sense as people are not at work and will do more things in the middle of the day on weekends. Unfortunately, there is no data in this dataset breaking down the activity by intensities only an average intensity column. This is also shown in figure 11 and directly correlates to the average calories burned and average total steps. It would have a been goo analysis to have been able to investigate this further as seen above to see if user became very active at certain times of the day or not however without the data it is impossible to determine this.

Figure 11: Average calories per hour workday (top left), on a weekend (top right), average step count per workday (middle left), on a weekend (middle right), Average intensity per workday (bottom left), on a weekend (bottom right).

5. Recommendations based on analysis

There are several recommendations that can be made based on the data presented here. Firstly, Bellabeats should highlight any specific features that their devices have for enhanced step counting/ calorie measurements in their devices as most users use their devices to track these metrics. Secondly Bellabeats should focus their advertising on Saturdays or early in the week as this is when people are most active. If they are trying to be active on these days, then they are more likely to respond to an advertisement for a new fitness device on these days. They should also focus on evenings during the week and midday adverts on the weekend to reach their desired audience. Finally, Bellabeats could use the findings above to design apps that could help prompt users to exercise based in their habits. If the prompt is sent at the correct time based on the data above it could encourage more use from individual users and promote returning customers to the brand.

Whilst these recommendations are based on the data available it would be prudent to remember that the data available was very limited and should be expanded upon if possible. Unfortunately, upon searching it was not possible to identify a suitable dataset that could be used to expand on this therefore the available data was analysed as is. If possible a further recommendation for Bellabeats would be to attempt to expand this study to further identify more trends among a greater sample size and different demographics as this could also provide key insights into how different users utilise their smart fitness devices.