olly - Fotolia

Reddit and the aspiring data scientist

For amateur data scientists, Reddit provides the opportunity to post about personal projects, chat with like-minded people and get free advice from a large community.

The cluster analysis below, using tiny dots against a white background in which the dots are sometimes so dense they appear as branching lines, shows two years of aspiring data scientist Quan Tran's life, plotted geographically with GPS data points.

The patterns display spots Tran visited when he was in college in Houston a few years ago, and also veer into the predictive analytics realm, indicating what places he is most likely to go in the future.

The multicolored visualization can almost be seen as art. Indeed, the image got a good deal of attention when Tran posted it to the Reddit subreddit DataIsBeautiful last month, earning, for a short while at least, a top spot in the 13 million-plus subscriber group.

For Tran, a developer at a small software company who is hoping to break into the machine learning and data science fields, the analysis wasn't just a fun or artsy project. It was a real chance to learn new skills and refine old ones, and to create another project to showcase in his machine learning portfolio.

By posting the work on Reddit, where Tran goes by the username NeedJobInBayArea, the aspiring data scientist was also afforded the potential for free and immediate feedback from the community.

GPS data points cluster analysis
Quan Tran's cluster analysis of his Houston GPS data points

Time to cluster

The cluster analysis, a small unsupervised project, was Tran's first project using a raw and unlabeled GPS data set. He captured the data using Google Timeline, which he downloaded using Google's Takeout feature. He then narrowed the scope of the data to limit it to his movements around Houston.

"I focused mostly on [the] data preprocessing part. There are lots of data sets that are raw/unprocessed or even unlabeled, which makes it harder to extract any useful information," Tran said in a message on Reddit's email-like private messaging system.

With more than 70,000 subscribers to the subreddit DataScience and close to 20,000 in Analytics, in addition to the big DataIsBeautiful subreddit, Tran has ample company on the forum. An aspiring data scientist can post to ask for advice on personal projects. These posts fill the subreddits and the communities appear to be helpful and quick to respond.

Building a forecast

Kyle Peterson, a software engineer working for a self-service analytics company in Atlanta, is another of the aspiring data scientists on Reddit. Like Tran, Peterson is working on a data science project on the side and learning a number of new skills while doing so.

Called Farsight Forecast, his project is a predictive analytics tool that is meant to make it easier for non-computer scientists to access "data scientist-quality forecasting," he said in a message.

"I've been exposed to how several different organizations forecast things throughout my career, and there's always been a lot of room for improvement," Peterson said.

"The problem is that building better forecasts usually requires coding skills -- particularly Python or R," he said. "Not everyone is going to have those skills, so I began building Farsight Forecast to give users access to modern forecasting algorithms through an easy-to-use UI."

Meanwhile, Tran largely relied on Python to clean, cluster, plot and visualize the data, taking advantage of free Python libraries like scikit-learn, a machine learning library, and Matplotlib, a 2D plotting library.

Some advice

On Reddit, Tran has already received at least one particularly helpful piece of advice.

To be frank, machine learning is an important field right now.
Quan Transoftware developer and aspiring data scientist

"There was a comment suggesting to try another algorithm, and it turned out to be the better one for my data set," Tran said. "I wish I could get more comments."

As for Peterson, he has been asking users in the SmallBusiness subreddit questions about how they forecast sales, costs and other numeric data for their small businesses. He has received several answers, and he is using them for his research.

While working on his project, which he hopes to someday commercialize, Peterson said he has been learning "a ton of new stuff," mainly in the marketing and sales domain.

On the technical side, however, he noted he has also learned to use Node.js, a JavaScript runtime.

New and improved skills

Peterson said working alone has presented some challenges.

"It's easy to take things for granted, like the provisioning of new servers or getting SSL [Secure Sockets Layer] set up, when you work on a team. But as a one-person outfit, I've got to figure everything out myself," Peterson said.

In Tran's case, he said he has also honed his skills and he hopes the project will build up his portfolio and help further his career.

"To be frank, machine learning is an important field right now," he said.

His Reddit projects, with the exposure and advice that comes with them, could help get him into that field.

Dig Deeper on Data science and analytics

Data Management
Content Management