Eng. Ahmad Abdelbaset

Raw Data Loading: Loaded 21,350 rows from a CSV scraped from Internet (Feb 2026). Each row represented a single used car listing with fields including car name (in Hebrew), year, mileage, price, transmission, motor type, color, body type, and more.

Translation from Hebrew to English: All categorical text fields were in Hebrew. Custom translator modules were used to convert car names, motor types (e.g. בנזין → Gasoline), ownership types (e.g. פרטית → Private), colors, and body types into English for standardized analysis.

Dropping Invalid Rows: Rows where the car name was missing or empty were identified and removed using dropna() on the car column. This eliminated ~2,400 rows where the scraper had returned incomplete records.

Removing Unknown Manufacturers: Rows belonging to rare or unrecognized Hebrew manufacturer names (113 rows across ~30 brands) that couldn't be reliably translated were dropped to keep the dataset clean and consistent.

Handling Missing Prices: Listings with the Hebrew placeholder "לא צוין מחיר" (price not specified) were replaced with NaN. Missing prices were then imputed using the mean price per manufacturer + model + year group, preserving realistic price estimates.

Fixing Fuel Consumption: The fuel_consumption column contained mixed types and missing values. It was converted to numeric, then missing values were filled using a cascade strategy: (1) mean by manufacturer + year + engine volume, (2) mean by manufacturer + year, (3) mean by manufacturer + motor type, and finally (4) the overall dataset mean. Electric cars were assigned 0 consumption.

Fixing Engine Volume: Engine volume was stored as an object type with commas and non-numeric characters. It was extracted using regex, converted to float, and missing values were imputed using the mean per manufacturer + model + year. Electric vehicles were assigned 0.

Handling Dates: Both test_date and on_street_date had partial missing values. A cross-fill strategy was applied: if one date was missing, it was inferred from the other. Remaining gaps were filled using the car's model year. Dates were then converted to proper datetime objects.

Feature Engineering: Added a z_price column (price z-score for outlier detection), an is_electric binary flag, and a price_category column using quantile-based binning into three tiers: cheap, medium, and expensive.

Final Export: The cleaned DataFrame (18,954 rows, 17+ columns, zero nulls) was saved to cleaned_full_cars_data_v4.csv in UTF-8 encoding and used as the input for all exploratory analysis.

Number of Cars per Year — **Fig 1.** Distribution of all car listings across model years — showing how many vehicles from each year appear in the dataset.

Top 10 Years by Car Count — **Fig 2.** The ten most represented model years sorted by frequency. Recent years (2018–2022) dominate the second-hand market.

Top 20 Car Manufacturers — **Fig 3.** The 20 most listed manufacturers. Toyota, Hyundai, and Kia lead the Israeli used car market by a wide margin.

Feature Correlation Heatmap — **Fig 4.** A diverging-palette heatmap showing pairwise correlations across all numeric features in the cleaned dataset.

Distribution by Number of Seats — **Fig 5.** Breakdown of listings by seating capacity. 5-seat vehicles make up the overwhelming majority.

Distribution by Body Type — **Fig 6.** Vehicle body type breakdown — hatchback, sedan, SUV, and more — across all listings.

Distribution by Color — **Fig 7.** Color popularity across all listed cars. White, silver, and black are the most common exterior colors.

Distribution by Owning Type — **Fig 8.** Ownership category proportions: private, leasing, company, and other types.

Distribution by Motor Type — **Fig 9.** Fuel type breakdown — Gasoline, Diesel, Electric, and Hybrid variants. Gasoline remains dominant.

Distribution by Transmission — **Fig 10.** Automatic vs. manual transmission split across all listings in the dataset.

Distribution by Hand (Previous Owners) — **Fig 11.** Number of previous owners per car. First and second-hand vehicles are the most frequently listed.

Price Distribution — Filtered (98th Percentile) — **Fig 13.** Price distribution after removing the top 2% outliers. Reveals a right-skewed peak around ₪60,000–₪120,000.

Number of Cars per Ownership Type — **Fig 14.** Bar chart of ownership type counts. Private ownership accounts for the vast majority (~17,700 listings).

Price Distribution — Top 10 Manufacturers (Violin Plot) — **Fig 15.** Violin plot comparing price distributions across the top 10 manufacturers. Luxury brands show wider spreads and higher medians.

Price vs. Year — Mazda — **Fig 16.** Scatter plot of Mazda listing prices by model year. Newer Mazdas command significantly higher prices.

Annotated Correlation Heatmap — **Fig 17.** Coolwarm heatmap with correlation coefficients annotated. Year and price show positive correlation; km and price are negatively correlated.

Pairplot — Numeric Features by Price Category — **Fig 18.** Multi-panel pairplot of key features (year, hand, km, engine volume, fuel consumption) colored by price tier: cheap, medium, and expensive.

Israeli Used Car Market — Data Analysis

Data Cleaning Pipeline

Project Highlights

Analysis & Visualizations

For the code and full notebooks, visit the GitHub repository by clicking Here.