Hi guys, I am asked to try and fit a ML model to a huge mobility dataset that i have, and I tried some models but fail to get a decent [performance metric], so i’d love a fresh set of eyes on this!

Features of the dataset

  • Each row represents the data for a certain “Origin-Destination” pair, for example pair “31 to 493” meaning this is from place 31 to place 493. The first feature is thus called pair.
  • For each mode of transport (drive, bike, walk, transit) there are 3 “cost”-features namely: [mode]_time, [mode]_cost, [mode]_convenience. So there are 12 features in total (4 modes x 3 costs)
  • Some extra features: average_income, cars_per_household, jobs_at_destination (representing the people travel in this pair
  • 4 observed features, one for each mode. These are the features to predict. This is a value between 0 and 1, representing how much % of people in this pair, use this mode of transport.

Additional information

  • sometimes the 3 costs for “transit” are 999, meaning that there is no transit option (train, tram, …) available for this pair. The usual costs lie between 0 and 100
  • I deleted the walk_cost feature because every entry was 0.
  • Here are the distributions of all the features:

https://preview.redd.it/8lt001so6k0c1.png?width=2002&format=png&auto=webp&s=adfe645606746c008941e36fbb35261d0600a8bb

https://preview.redd.it/yeqx4yro6k0c1.png?width=2046&format=png&auto=webp&s=011d05795c203d57d9402805377b0e8741c2378c

https://preview.redd.it/7dtk0gso6k0c1.png?width=2042&format=png&auto=webp&s=6ef5ed2d5f9044dfede43a765ae19afb5c487486

  • And the correlation matrix:

https://preview.redd.it/61uwu9fx6k0c1.png?width=1718&format=png&auto=webp&s=d299ca5ffa97dadd45804a62ed2b17628f51d0f9

So the goal is to predict those 4 obsrv features. I am very curious which ML model you would use for this and why?

If you have any other suggestions, e.g. pre-processing techniques on the data/features, do share!

Thank you guys!

  • ZijdehoenOPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The pairs are literally just codes. They don’t have any numeric meaning or anything. It just represents a certain trajectory (2 locations)