31/03/2022  •   6 min read  

How Web Scraping is Used to Extract Indeed Job Data and Predict Salaries?

How-Web-Scraping-is-Used-to-Extract-Indeed-Job-Data-and-Predict-Salaries

The initial step is to scrape data from Indeed.com job posts, such as job title, employee details, job profile, location, and salary details. This was accomplished by scraping a variety of Indeed search pages.

The-initial-step-is-to-scrape-data

You will need to create a list of 30 major cities across the country from which you require job details. Now, to get access to several job ads from a single page, you will require advanced search options.

Indeed-Job-Data

For pages to display 100 search results, you will require to change limit=50 to limit=100. Then, to filter through the search results pages, you can modify start=0 to start=100, start=200, and up to start=900 to receive up to 1000 results for each city. Then, by just changing the l=Washington+DC in the URL to another city name, such as l=Pittsburgh, you can do this for all 30 cities on my list.

For performing the above process, you will need to build one of those for loops that looped through each of my 30 towns and then ran through each of the search sites, pulling up to 1000 job ads for every city. The first issue that occurs throughout the data scraping process is that not all Indeed job posts include a salary. To get around this, you can make a simple try/except the statement that returns 'NA' if no salary was specified.

You can run a web scraper several times over two or three days and scrape maximum data. Then provide the results into a panda DataFrame and undergo analysis.

To extract salary details, you will need to narrow the results to include job ads with salary information and select only include yearly salaries. The next step was to calculate the median salary of my findings, which was $110,000, and then establish a binary variable for each position — 1 if the pay was more than the median and 0 if the income was lower than the median.

You can utilize a clean and full DataFrame across 500 records and develop a classification model with several attributes and then investigate what aspects lead a job to be classified as "high" or "low" paying. For the models, I chose random forest classification since it is one of the most accurate learning algorithms and it also offers estimates of which variables are significant in the classification, which is extremely useful for our investigation.

If you are interested in scraping particular details such as location, title, and job summary then you will need to develop a specific model that will extract the above information. The location model developed will deliver the result with 66% accuracy which is very less in comparison to the other three models.

For spontaneous forest models, the feature beginnings attribute returns a value for each data, for every city, indicating how significant that feature is in the model's prediction process. You can use the algorithm to discover the places with the best predictive power, and compared each city's median wage to the total median salary used in the study. The outcomes were not unexpected. My model revealed that living in a large, costly city usually indicated a better wage. New York, San Jose, San Francisco, Boston, and Philadelphia were among the cities. Smaller and less costly places, such as St. Louis, MO, Coral Gables, FL, Pittsburgh, PA, Houston, TX, and Austin, TX, often indicated a lesser pay.

Much better results were obtained using the job title and job summary models. To elaborate on the model-building procedure, I initially was using a count vectorizer function to count the number of times each word featured in the job title and how many times each word came. To construct a matrix of term-frequency values for all job postings, this is done across all job titles.

The job description model was created using the same natural language processing approaches. Based on the terms in the job summary, the algorithm was able to properly detect whether a position was a high or low paying job roughly 83 percent of the time. Machine, learning, data, analytics, engineer, and python were frequently linked to high-paying employment, whereas health, research, and university were frequently linked to low-paying occupations.

For an employee, our findings can help us determine how much a job prospect is worth based on the position for which they are applying and the abilities necessary for that position. A data scientist with strong python abilities, for example, can be paid more than a data analyst. Also, if a corporation wants to grow its data science team, it can consider doing so in a city like St. Louis, MO, or Houston, TX, where data scientists aren't compensated as well.

There is certainly a big assumption that is being made when doing an analysis like this. This problem assumes that the data scientist salaries that are posted on Indeed.com are representative of all data scientist salaries. This is a not a very safe assumption to make, since most companies do not include salary information on job postings. While this assumption may give us an inaccurate estimate of the median salary for a data scientist, it is believable that our predictions of whether a certain job is a high or low paying job are still valid.

Looking for any other web scraping services, contact iWeb Scraping today!! Or request for a quote!


Web Scraping

Get A Quote