Israeli Used Car Market — Data Analysis

18,954 Listings 17 Features 50+ Manufacturers 18 Visualizations Python · Pandas · Seaborn

This project is a full end-to-end data analysis of the Israeli used car market, based on real listings scraped from Internet. The dataset started as raw, Hebrew-language listings and was cleaned, translated, and explored to uncover patterns in pricing, manufacturer popularity, vehicle features, and key correlations.


Data Cleaning Pipeline


1
Raw Data Loading: Loaded 21,350 rows from a CSV scraped from Internet (Feb 2026). Each row represented a single used car listing with fields including car name (in Hebrew), year, mileage, price, transmission, motor type, color, body type, and more.
2
Translation from Hebrew to English: All categorical text fields were in Hebrew. Custom translator modules were used to convert car names, motor types (e.g. בנזין → Gasoline), ownership types (e.g. פרטית → Private), colors, and body types into English for standardized analysis.
3
Dropping Invalid Rows: Rows where the car name was missing or empty were identified and removed using dropna() on the car column. This eliminated ~2,400 rows where the scraper had returned incomplete records.
4
Removing Unknown Manufacturers: Rows belonging to rare or unrecognized Hebrew manufacturer names (113 rows across ~30 brands) that couldn't be reliably translated were dropped to keep the dataset clean and consistent.
5
Handling Missing Prices: Listings with the Hebrew placeholder "לא צוין מחיר" (price not specified) were replaced with NaN. Missing prices were then imputed using the mean price per manufacturer + model + year group, preserving realistic price estimates.
6
Fixing Fuel Consumption: The fuel_consumption column contained mixed types and missing values. It was converted to numeric, then missing values were filled using a cascade strategy: (1) mean by manufacturer + year + engine volume, (2) mean by manufacturer + year, (3) mean by manufacturer + motor type, and finally (4) the overall dataset mean. Electric cars were assigned 0 consumption.
7
Fixing Engine Volume: Engine volume was stored as an object type with commas and non-numeric characters. It was extracted using regex, converted to float, and missing values were imputed using the mean per manufacturer + model + year. Electric vehicles were assigned 0.
8
Handling Dates: Both test_date and on_street_date had partial missing values. A cross-fill strategy was applied: if one date was missing, it was inferred from the other. Remaining gaps were filled using the car's model year. Dates were then converted to proper datetime objects.
9
Feature Engineering: Added a z_price column (price z-score for outlier detection), an is_electric binary flag, and a price_category column using quantile-based binning into three tiers: cheap, medium, and expensive.
10
Final Export: The cleaned DataFrame (18,954 rows, 17+ columns, zero nulls) was saved to cleaned_full_cars_data_v4.csv in UTF-8 encoding and used as the input for all exploratory analysis.


Project Highlights

This project demonstrates a complete data science workflow: from raw web-scraped data with mixed Hebrew/English text, through multi-step cleaning and imputation, to rich exploratory visualizations. It covers distribution analysis, manufacturer comparisons, price modelling, feature correlation, and pairwise relationship analysis — all implemented in Python using Pandas, Matplotlib, and Seaborn.



Analysis & Visualizations


Number of Cars per Year
Fig 1. Distribution of all car listings across model years — showing how many vehicles from each year appear in the dataset.
Top 10 Years by Car Count
Fig 2. The ten most represented model years sorted by frequency. Recent years (2018–2022) dominate the second-hand market.
Top 20 Car Manufacturers
Fig 3. The 20 most listed manufacturers. Toyota, Hyundai, and Kia lead the Israeli used car market by a wide margin.
Feature Correlation Heatmap
Fig 4. A diverging-palette heatmap showing pairwise correlations across all numeric features in the cleaned dataset.
Distribution by Number of Seats
Fig 5. Breakdown of listings by seating capacity. 5-seat vehicles make up the overwhelming majority.
Distribution by Body Type
Fig 6. Vehicle body type breakdown — hatchback, sedan, SUV, and more — across all listings.
Distribution by Color
Fig 7. Color popularity across all listed cars. White, silver, and black are the most common exterior colors.
Distribution by Owning Type
Fig 8. Ownership category proportions: private, leasing, company, and other types.
Distribution by Motor Type
Fig 9. Fuel type breakdown — Gasoline, Diesel, Electric, and Hybrid variants. Gasoline remains dominant.
Distribution by Transmission
Fig 10. Automatic vs. manual transmission split across all listings in the dataset.
Distribution by Hand (Previous Owners)
Fig 11. Number of previous owners per car. First and second-hand vehicles are the most frequently listed.
Price Distribution — Filtered (98th Percentile)
Fig 13. Price distribution after removing the top 2% outliers. Reveals a right-skewed peak around ₪60,000–₪120,000.
Number of Cars per Ownership Type
Fig 14. Bar chart of ownership type counts. Private ownership accounts for the vast majority (~17,700 listings).
Price Distribution — Top 10 Manufacturers (Violin Plot)
Fig 15. Violin plot comparing price distributions across the top 10 manufacturers. Luxury brands show wider spreads and higher medians.
Price vs. Year — Mazda
Fig 16. Scatter plot of Mazda listing prices by model year. Newer Mazdas command significantly higher prices.
Annotated Correlation Heatmap
Fig 17. Coolwarm heatmap with correlation coefficients annotated. Year and price show positive correlation; km and price are negatively correlated.
Pairplot — Numeric Features by Price Category
Fig 18. Multi-panel pairplot of key features (year, hand, km, engine volume, fuel consumption) colored by price tier: cheap, medium, and expensive.

For the code and full notebooks, visit the GitHub repository by clicking Here.