Skip to content

Latest commit

 

History

History
89 lines (51 loc) · 7.73 KB

Proposal.md

File metadata and controls

89 lines (51 loc) · 7.73 KB

Final Project Proposal

GitHub Repo URL

Team Zebra Final Project Proposal

Team Members

Naman Arora, Somya Agarwal, Ruhi Patel, Nate Fairbank

Dataset

We selected the results from the 2018 Stack Overflow Developer Survey. This questionnaire had over 100,000 respondents who reported information ranging from their exercise habits to compensation levels. The strengths of this data are that it broad and fairly standardized, and that some rudimentary cleaning of the data was done for the public release. The downsides are that results of a mass-solicited, self-reported survey may not be reliable, and may suffer from massive selection bias in terms of who saw and was willing to fill out the survey. Additionally, 100,000 entries may not be large enough to permit any complicated machine learning, and the data may become unreliably sparse once filtered for specific feature values.

Data URL

Kaggle

The question

“What makes a programmer successful?" This question is intentionally broad, to allow the user to project their own perception and line of inquiry onto it, prompting them to explore the data. For example, the pragmatic (and financially motived) reader might be interested in the answer to the sub-question “what computer language should I learn to make the most money?" Another, more socially-oriented user might wonder “do white men stay in the field of computer science longer than minorities or women?" A third viewer might be curious about what undergraduate majors generate the highest career satisfaction. All three users have, in a way, asked “what makes a programmer successful”, with different means of generating success, and different definitions of success itself.

Further comments

It seems that a user-driver “Interactive Visualization/ Application Track” is best-suited to addressing this question. Our application would briefly explain the data and ask the question, perhaps with some initial suggestions for exploration, drawing the user in and allowing them to explore the data along their own path. However, if we find a series of compelling relationships a narrative track might allow us to better direct the user towards these key insights.

Next steps:

Sketching:

each of us will make 3x initial sketches of questions we would like to try to answer. We should have the data schema in front of us while sketching, but not the actual data. We can then compare the 12 resulting sketches as a group, pick the best ones, refine them.

Data exploration:

we will then use Tableau to do initial exploration of the features we are interested in, checking for interesting relationships between features, feature values and summary statistics, and compelling visualizations.

Sketching round 2:

This round of sketching will focus on tying together the areas of exploration that we feel are most promising based on the data exploration. How will the user navigate through the data? What interactivity will occur between graphs?

This takes us through the initial design phase of the project. Once we have sketched individual visualizations, explored the data, and sketched a way to link together the individual visualizations we will be better able to plan the actual development of the product.

Data Analysis

We are using DeepNote (Pandas), Tableau and Trifacta for collaboratively working on data analysis.

The dataset is quite rich and in good shape. Since the data is collected using surveys, there are certainly some nulls which we would consider dropping.

SfinA4GDVH aBhZPFzFw5

We have a large amount of data with 129 features available. We can drill down into each and choose what make the msot sense going forward. We have explored a subset of these in our sketches and initial EDA process.

msedge_s8G6E8LMMB

Sketches

bFOtfENE32

So the user gets asked "do you want to be a rich coder or a happy coder". Let's pretend they say "rich"- if they say "happy" then career satisfaction and "money" in the below description will be reversed.
Income is on the y-axis, and the x-axis is a category that the user selects from a list. It starts out as a histogram but could be any of the list I have there (or something else). So they can, for instance, say "what programming language makes the most money". But the cool thing is then they can click a bar on the chart, and on a separate 1-D chart it shows the career satisfaction of people in that category. So for instance you might see that Java programmers make more money than python programmers, but when you click their bars you see that Python programmers are happier.

LBS3XpzT2l

This chart shows how participation in a coding community affects success. I had envisioned career satisfaction and length being encoded on the vertical and horizontal axis, but I suppose other metrics of success such as salary could also be displayed (perhaps as a toggle).
Each person would be plotted as a point on the graph, with a color encoding indicating the count of community-oriented activities (the size encoding shown above would not be effective). Tooltips would allow the user to show the specific activities that someone participated in. A cool interactivity would be to allow the user to select by area some subset of the population (or by activity count) and then display some extra information, such as the average salary of the selected population.

Programming languages used by those developers:-

kD2H6yBKmm

Education level of these developers:-

W3JRtxGV4x

The country these developers belong to:-

(Most satisfied developers reported are from the United States)

UoHFDTaXZh

Demographics:

We plan to show the distribution on age,race,gender,location on x axis and y axis will be income distribution. We also plan to have a bar to demonstrate career satisfaction or job satisfaction . So it can be interpreted as who makes the more money and how satisfied they are with their jobs. The demographics info can go in the dropdown
Ul5i4pAuW7

Personal Habits:

Exploring columns like exercise,skipMeal, wakeTime, Computer hours. The idea is to show that necessarily not skipping meals, or not doing exercise would make you earn more/ or make you more satisfied or something.

uTlgQOshPK

What can you expect to earn? How do you rank?

The expected earnings of an individual can be derived from a regression model. We can also add explanability elements to visualization that shows just how much factor provided by the user has influenced the salary number that we have predicted.

f1LoBV1d07